Introduction

The psychological construct of self-regulation, or related concepts of impulsivity, self-control, inhibition and others correlate with problematic real-world behaviors such as disordered eating (Mobbs et al., 2010, Verbeken et al., 2009, Nederkoorn et al. 2006a, Nederkoorn et al., 2006b), gambling (Lawrence et al., 2009, Fuentes et al., 2006, Alessi and Petry, 2003), drug addiction (Coffey et al., 2003, de Wit, Crean and John, 2000, Sher, Bartholow and Wood, 2000, Kirby, Petry and Bickel, 1999) or bad financial decisions (Meier and Sprenger, 2015, Meier and Sprenger, 2013, Meier and Sprenger 2012). In these lines of work measures of self-regulation are used as behavioral assays for individual difference analyses. An underlying assumption of treating behavioral measures as reflective of person-specific characteristics is that the measures of self-regulation are stable across time. In other words, that they have high test-retest reliability (or high between- and low within-subject variability). This assumption is not extensively and equally tested in the literature for all measures of self-regulation.
Self-regulation measures can be grouped in two broad categories: Surveys and cognitive tasks. While the former typically goes through some form of psychometric testing often filtering out items with low test-retest reliability this is less frequently true for the latter. Though some empirical results on test-retest reliabilities of certain cognitive task measures are reported in pockets of the literature an exhaustive summary or systematic elimination of tasks or measures based on stability across time does not seem to exist. Test-retest reliability is, however, crucial for individual difference analyses (Hedge, Powell and Sumner, 2017).
To answer the question of reliability of self-regulation measures comprehensively we created a large battery consisting of both cognitive task and survey measures. In this paper we make two contributions: First we provide a review of these measures along with their reported reliabilities when available. Then we report results of a large study we conducted where we collected retest reliability data using all of these measures. This effort not only fills in a notable gap in the literature in clarifying which measures of self-regulation are stable across time but also provides guidance on factors to pay attention to when constructing new tasks for individual difference measures.

Methods

Data collection

This sample consists of data for 150 subjects of the original sample of 522 that has completed the initial battery of 37 cognitive tasks, 23 surveys and 3 surveys on demographics. Details of the original sample as well as quality control (qc) procedures are described elsewhere (Eisenberg et al., 2017). Invited participants were chosen randomly and only subsets of them were invited for a given batch (instead of opening the battery to all qualified subjects) with the intention to avoid a potential oversampling and bias towards “high self regulators”.

workers = read.csv(paste0(retest_data_path,'Local/User_717570_workers.csv'))
workers = workers %>% 
  group_by(Worker.ID) %>%
  mutate(Retest_worker=ifelse(sum(CURRENT.RetestWorker,CURRENT.RetestWorkerB2,CURRENT.RetestWorkerB3,CURRENT.RetestWorkerB4,CURRENT.RetestWorkerB5,na.rm=T)>0,1,0)) %>%
  ungroup()

worker_counts <- fromJSON(paste0(retest_data_path,'/Local/retest_worker_counts.json'))

worker_counts = as.data.frame(unlist(worker_counts))
names(worker_counts) = "task_count"

In total 242 participants were invited, 175 began the battery, 157 completed the battery and 150 provided data that passed qc for both time points. Our target sample size was determined in advance of data collection and data collection continued until this number of participants with data that survived qc was reached.

disc_comp_date = read.csv(paste0(retest_data_path,'Local/discovery_completion_dates.csv'), header=FALSE)
val_comp_date = read.csv(paste0(retest_data_path,'Local/validation_completion_dates.csv'), header=FALSE)
test_comp_date = rbind(disc_comp_date, val_comp_date)
rm(disc_comp_date, val_comp_date)
retest_comp_date = read.csv(paste0(retest_data_path,'Local/retest_completion_dates.csv'), header=FALSE)
comp_dates = merge(retest_comp_date, test_comp_date, by="V1")
names(comp_dates) <- c("sub_id", "retest_comp", "test_comp")
comp_dates$retest_comp = as.Date(comp_dates$retest_comp)
comp_dates$test_comp = as.Date(comp_dates$test_comp)
comp_dates$days_btw = with(comp_dates, retest_comp-test_comp)

Data collection took place on average 115 number of days after the completion of the initial battery with a range of 60 to 228 days.

rm(test_comp_date, retest_comp_date, comp_dates)

Demographics

test_demog <- read.csv(paste0(retest_data_path, '/t1_data/demographic_health.csv'))

retest_demog <- read.csv(paste0(retest_data_path, 'demographic_health.csv'))

retest_demog = retest_demog[retest_demog$X %in% test_demog$X,]

names(test_demog)[which(names(test_demog) == 'X')] <-'sub_id'
names(retest_demog)[which(names(retest_demog) == 'X')] <-'sub_id'

summary(test_demog %>%
          select(Sex, Age))
      Sex            Age      
 Min.   :0.00   Min.   :21.0  
 1st Qu.:0.00   1st Qu.:28.0  
 Median :1.00   Median :33.0  
 Mean   :0.52   Mean   :34.1  
 3rd Qu.:1.00   3rd Qu.:38.8  
 Max.   :1.00   Max.   :59.0  
summary(retest_demog %>%
          select(Sex, Age))
      Sex             Age      
 Min.   :0.000   Min.   :21.0  
 1st Qu.:0.000   1st Qu.:29.0  
 Median :1.000   Median :33.0  
 Mean   :0.527   Mean   :34.5  
 3rd Qu.:1.000   3rd Qu.:38.8  
 Max.   :1.000   Max.   :60.0  

Literature

One of the major contributions of this project is a comprehensive literature review of the retest reliabilities of the surveys and tasks that were used. We reviewed the literature on a measure (as opposed to task level) paying attention to differences in sample size, the delay between the two measurements as well as the statistic that was used to assess reliabilities (e.g. Spearman vs. Pearson correlations). Here we present a table and a visualization summarizing our findings. References mentioned in the table below can be found here.

lit_review <- read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/input/lit_review_figure.csv')

lit_review
lit_review = lit_review %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)]),
         type = as.character(type)) %>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  arrange(task_group, raw_fit, var) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group)) %>%
  select(-measure_description)
Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
else paste0(labels, : duplicated levels in factors are deprecated
tmp = lit_review[duplicated(lit_review$reference)==FALSE,]

nrow(tmp)
[1] 154
sum(tmp$sample_size)
[1] 17550
nrow(lit_review)
[1] 583
rm(tmp)

Measure level plot

lit_review = lit_review %>%
  select(-reference)

Because this plot is difficult to digest we summarize it on a task level to give a general sense of the main takeaways. This plot naturally disregards much of the fine grained information.

p1_t_legend <- lit_review %>%
  filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='black')+
  theme(axis.text.y = element_text(size=43),
        legend.position = 'bottom',
        axis.text.x = element_text(size=23),
        axis.title.x = element_text(size=30), 
        legend.text = element_text(size=20), 
        legend.key.width = unit(0.75, "inches"), 
        legend.title = element_text(size=28),
        legend.spacing.x = unit(0.5, "inches")) + 
  guides(size = guide_legend(override.aes = list(size=c(9,18,28))),
         shape = guide_legend(override.aes = list(size=18)))+
  xlab("Retest Reliability")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3), name="Type")+
  scale_size_continuous(name = "Sample Size")+
  geom_vline(xintercept = 0, color = "red", size = 1)

p1_t <- lit_review %>%
  filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='#00BFC4')+
  theme(axis.text.y = element_text(size=43),
        legend.position = 'none',
        axis.text.x = element_text(size=23),
        axis.title.x = element_text(size=30)) + 
  xlab("Retest Reliability")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
  scale_size_continuous(range = c(5, 35))+
  geom_vline(xintercept = 0, color = "red", size = 1)

p2_t <- lit_review %>%
  filter(task == 'survey') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='#F8766D')+
  theme(axis.text.y = element_text(size=43),
        legend.position = 'none',
        axis.text.x = element_text(size=23),
        axis.title.x = element_text(size=30)) + 
  xlab("Retest Reliability")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
  scale_size_continuous(range = c(5, 35))+
  geom_vline(xintercept = 0, color = "red", size = 1)

mylegend<-g_legend(p1_t_legend)

p3_t <- arrangeGrob(arrangeGrob(p1_t, p2_t, nrow=1), mylegend, nrow=2,heights=c(10, 1))

ggsave('Lit_Review_Plot_t.jpg', plot = p3_t, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 72)
rm(p1_t, p2_t, p3_t, mylegend, p1_t_legend)

An interactive version of this plot could be find here

Takeaways from this review are:
- Survey measures have been reported to higher reliability compared to task measures
- Survey measures have less variability in the reported reliabiltiy estimates compared to task measures

Loading datasets

The variables included in this report are:
- meaningful variables (includes only hdddm parameters)
- EZ diffusion parameters
- Raw RT and Accuracy measures
- Variables found in the literature (for comparison)

Load time 1 data

test_data <- read.csv(paste0(retest_data_path,'t1_data/variables_exhaustive.csv'))

test_data <- test_data[,names(test_data) %in% retest_report_vars]

test_data$X <- as.character(test_data$X)
names(test_data)[which(names(test_data) == 'X')] <-'sub_id' 

For reference here are the variables that are not included in the analyses of the remainder of this report because they were not of theoretical interest in factor structure analyses of this data so far. These include drift diffusion and other model parameters for specific conditions within a task; survey variables that are not part of the dependant variables for that survey in the literature and demographics (these are saved for prediction analyses).

test_data2 <- read.csv(paste0(retest_data_path, 't1_data/variables_exhaustive.csv'))

df <- data.frame(names(test_data2)[which(names(test_data2) %in% names(test_data) == FALSE)])
names(df) = c('vars')

df

Load time 2 data

retest_data <- read.csv(paste0(retest_data_path,'variables_exhaustive.csv'))

retest_data <- retest_data[,names(retest_data) %in% retest_report_vars]

retest_data$X <- as.character(retest_data$X)
names(retest_data)[which(names(retest_data) == 'X')] <-'sub_id' 
retest_data = retest_data[retest_data$sub_id %in% test_data$sub_id,]

Replace HDDM parameters in t1 data

Since HDDM parameters depend on the sample on which they are fit we refit the model on t1 data for the subjects that have t2 data. Here we replace the HDDM parameters in the current t1 dataset with these refitted values.

hddm_refits <- read.csv(paste0(retest_data_path,'t1_data/hddm_refits_exhaustive.csv'))

hddm_refits = hddm_refits[,names(hddm_refits) %in% retest_report_vars]

hddm_refits$X <- as.character(hddm_refits$X)
names(hddm_refits)[which(names(hddm_refits) == 'X')] <-'sub_id' 

#For later comparison of whether fitting the DDM parameters on full or retest sample makes a big difference
test_data_full_sample_hddm <- test_data

#drop hddm columns from test_data
test_data = cbind(test_data$sub_id, test_data[,names(test_data) %in% names(hddm_refits) == FALSE])

#fix naming before merging
names(test_data)[which(names(test_data) == 'test_data$sub_id')] <-'sub_id'

#merge hddm refits to test data
test_data = merge(test_data, hddm_refits, by="sub_id")

Results

Data quality checks

Demographics reliability

Point estimates of reliability for the demographic variabels.

numeric_cols = c()

for(i in 1:length(names(test_demog))){
  if(is.numeric(test_demog[,i])){
    numeric_cols <- c(numeric_cols, names(test_demog)[i])
  }
}

demog_rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
                     icc = rep(NA, length(numeric_cols)),
                     pearson = rep(NA, length(numeric_cols)))

row.names(demog_rel_df) <- numeric_cols

for(i in 1:length(numeric_cols)){
  demog_rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog) 
  demog_rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
  demog_rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
}

demog_rel_df = demog_rel_df %>%
  mutate(var = row.names(.)) %>%
  select(var, icc, spearman, pearson) %>%
  arrange(-icc)

demog_rel_df
sjt.df(demog_rel_df %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/demog_rel_table.doc")
var icc spearman pearson
HispanicLatino 1 1 1
Age 0.998 0.996 0.997
Sex 0.993 0.987 0.987
HouseholdIncome 0.988 0.955 0.977
ArrestedChargedLifeCount 0.958 0.829 0.92
HighestEducation 0.951 0.895 0.908
WeightPounds 0.951 0.936 0.907
ChildrenNumber 0.95 0.971 0.907
HowLongSmoked 0.949 0.877 0.904
RelationshipNumber 0.948 0.9 0.901
CigsPerDay 0.945 0.841 0.895
LifetimeSmoke100Cigs 0.944 0.894 0.894
AlcoholHowManyDrinksDay 0.936 0.871 0.881
MortgageDebt 0.93 0.833 0.87
RetirementPercentStocks 0.928 0.863 0.867
LongestRelationship 0.928 0.853 0.867
HowOftenFailedActivitiesDrinking 0.927 0.782 0.866
CoffeeCupsPerDay 0.926 0.876 0.863
RentOwn 0.926 0.862 0.862
HowOftenMemoryConcentrationProblemCannabis 0.925 0.594 0.863
SmokeEveryDay 0.916 0.802 0.846
HeightInches 0.907 0.888 0.83
DivorceCount 0.905 0.935 0.836
CannabisHowOften 0.903 0.856 0.823
EducationDebt 0.901 0.837 0.821
HowOftenDevotedTimeCannabis 0.898 0.814 0.814
MedicalProblemsDueToDrugUse 0.888 0.814 0.814
AlcoholHowOften 0.881 0.788 0.79
RelationshipStatus 0.87 0.788 0.77
HowSoonSmokeAfterWaking 0.868 0.787 0.767
RetirementAccount 0.86 0.756 0.756
AlcoholHowOften6Drinks 0.853 0.766 0.747
CreditCardDebt 0.846 0.808 0.733
RelativeFriendConcernedDrinking 0.844 0.71 0.733
CannabisPast6Months 0.84 0.725 0.725
TeaCupsPerDay 0.834 0.633 0.715
HowOftenCantStopDrinking 0.833 0.704 0.716
CaffienatedSodaCansPerDay 0.832 0.711 0.723
HowOftenGuiltRemorseDrinking 0.829 0.564 0.708
CannabisHoursStoned 0.825 0.698 0.703
DoctorVisitsLastMonth 0.815 0.351 0.701
HowOftenDrinkMorning 0.809 0.39 0.692
CannabisConsideredReduction 0.806 0.638 0.675
NeurologicalDiagnoses 0.795 0.66 0.66
HowOftenUnableRememberDrinking 0.786 0.61 0.651
FeelBadGuiltyDrugUse 0.782 0.642 0.642
TrafficAccidentsLifeCount 0.781 0.585 0.643
HowOftenCantStopCannabis 0.748 0.732 0.598
InjuredDrinking 0.734 0.626 0.592
CarDebt 0.716 0.698 0.559
DaysHalfLastMonth 0.715 0.552 0.559
Worthless 0.709 0.632 0.549
DaysPhysicalHealthFeelings 0.705 0.533 0.545
Depressed 0.689 0.561 0.527
EverythingIsEffort 0.682 0.553 0.517
Hopeless 0.678 0.579 0.516
OtherDrugs 0.658 0.492 0.492
RestlessFidgety 0.621 0.507 0.451
TrafficTicketsLastYearCount 0.621 0.438 0.45
Nervous 0.615 0.473 0.445
NeglectedFamilyDrugUse 0.488 0.342 0.342
DaysLostLastMonth 0.475 0.604 0.312
EngagedInIllegalActsToObtainDrugs 0.468 0.306 0.306
AbleToStopDrugs 0.458 0.306 0.306
BlackoutFlashbackDrugUse 0.424 0.272 0.272
HowOftenFailedActivitiesCannabis 0.423 0.636 0.378
WidthdrawalSymptoms 0.403 0.254 0.254
Last30DaysUsual 0.395 0.285 0.248
GamblingProblem 0.317 0.38 0.195
AbuseMoreThanOneDrugAtATime 0.294 0.173 0.173
HowOftenHazardousCannabis 0.057 0.369 0.029
CaffieneOtherSourcesDayMG 0 0.363 -0.033
SpouseParentsComplainDrugUse -0.033 -0.017 -0.017
OtherDebtAmount -0.083 0.005 -0.041
summary(demog_rel_df)
     var                 icc             spearman          pearson       
 Length:74          Min.   :-0.0831   Min.   :-0.0166   Min.   :-0.0406  
 Class :character   1st Qu.: 0.6793   1st Qu.: 0.5373   1st Qu.: 0.5161  
 Mode  :character   Median : 0.8323   Median : 0.7006   Median : 0.7156  
                    Mean   : 0.7417   Mean   : 0.6558   Mean   : 0.6446  
                    3rd Qu.: 0.9257   3rd Qu.: 0.8401   3rd Qu.: 0.8628  
                    Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  

Effect of delays between the two measurements

Due to our data collection strategy we did not strictly control for the delay between the two measurements as standard psychometrics studies measuring retest reliability might.

An individual difference measure would preferably remain stable regardless of the delay between multiple measurements.

Since we only had two measurements we could not test directly whether a measure becomes less reliable depending on the delay between the two time points (The average number of days between two measurements would be the same for all measures since reliability is a measure level metric while days between completion a subject level one).

So if you regress the average difference for each measure on average delay between two measurements you are regressing a vector with varying numbers on a vector of same values, like a t test asking if the mean of the varying column, in this case the average difference score, is different than the unique value in the single value column, i.e. the average delay between two time points (Note that this should be true in theory but in practice since for each measure there might subjects for whom the dv could not be calculated there might be >1 unique values for the average delay between the two time points). This analysis is not meaningful.

Instead of the summary metric like the retest reliability estimate for each measure we can check whether the difference score distribution for each measure depends on the delay between the two measurements. Since the difference score distribution is at subject level we can check whether the order of subjects in this distribution depends on their order in the distribution of days between completing the two tests.

Make data frame with difference between two scores for each measure for each subject. Since the scores for different measures are on different scales for comparability the difference scores are normalized (ie demeaned and dividied by the sd of the difference score distribution.) Note that the variance of the difference score distribution accounts for the variance in both time points by summing them. Normalization equates the means of each difference score distribution to 0 which would mask any meaningful change between the two time points but the analysis here does not interpret the mean of the difference score distributions but is interested in its relation to the days between completion. We check if the variables show systematic differences between the two points later.

Here we check if the difference is larger the longer the delay

numeric_cols = c()

for(i in 1:length(names(test_data))){
  if(is.numeric(test_data[,i]) & names(test_data)[i] %in% names(retest_data)){
    numeric_cols <- c(numeric_cols, names(test_data)[i])
  }
}

t1_2_difference = data.frame()

for(i in 1:length(numeric_cols)){
  tmp = match_t1_t2(numeric_cols[i],format='wide')
  tmp = tmp %>% 
  mutate(difference = scale(`2` - `1`))
  t1_2_difference = rbind(t1_2_difference, tmp)
}

t1_2_difference$difference = as.data.frame(t1_2_difference$difference)$V1

t1_2_difference = t1_2_difference %>% separate(dv, c("task", "dv2"), sep="\\.", remove=FALSE)

Add completion dates to this data frame.

retest_task_comp_times = read.csv(paste0(retest_data_path, 'Local/retest_task_completion_times.csv'))
test_task_comp_times = read.csv(paste0(retest_data_path, 'Local/test_task_completion_times.csv'))
task_comp_times = merge(retest_task_comp_times, test_task_comp_times, by=c('worker_id','task'))
rm(retest_task_comp_times, test_task_comp_times)
task_comp_times = task_comp_times %>%
  select(-X.x, -X.y) %>%
  mutate(finish_day.x = as.Date(finish_day.x),
         finish_day.y = as.Date(finish_day.y),
         days_btw = finish_day.x-finish_day.y) %>%
  rename(sub_id=worker_id)
t1_2_difference = merge(t1_2_difference, task_comp_times[,c('sub_id', 'task','days_btw')], by=c('sub_id', 'task'))

What does the distribution of differences look like: The distribution of differences between two time points for each measure

t1_2_difference %>%
  ggplot(aes(difference, alpha=dv))+
  geom_histogram(position='identity')+
  theme(legend.position = 'none')

How do the difference score distributions look like with respect to the days between completion?

t1_2_difference %>%
  ggplot()+
  geom_smooth(aes(as.numeric(days_btw), abs(difference), group=factor(dv)), method='lm', se=FALSE)+
  geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
  theme(legend.title = element_blank())+
  xlab('Days between completion')+
  ylab('Absolute Scaled difference score')

To test if the slope of the black is significant we would run a mixed effects model with a fixed effect for days between completion, random slope for each dv depending on the days between and random intercept for each dv.

Before I was using subjects as a random effect but days between the two time points for each measure depends on subj id. What varies randomly is which dv we are looking for its distribution of differences in relation to the days between the time points. So I changed the model to have fixed effect for the days between, a random slope for (dependent variables can be differentially sensitive to th effect of days between) and a random intercept for dependent variable.

Significant fixed effect suggests that on average the longer the delay the smaller the difference.

summary(lmerTest::lmer(abs(difference) ~ scale(days_btw)+(scale(days_btw) | dv), data=t1_2_difference)) 
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: abs(difference) ~ scale(days_btw) + (scale(days_btw) | dv)
   Data: t1_2_difference

REML criterion at convergence: 132273

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-1.127 -0.719 -0.260  0.402 15.807 

Random effects:
 Groups   Name            Variance Std.Dev. Corr
 dv       (Intercept)     0.003190 0.0565       
          scale(days_btw) 0.000149 0.0122   1.00
 Residual                 0.468366 0.6844       
Number of obs: 63454, groups:  dv, 446

Fixed effects:
                  Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)        0.72218    0.00382  457.00000  189.05   <2e-16 ***
scale(days_btw)   -0.00985    0.00278 3236.00000   -3.55   0.0004 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
scl(dys_bt) 0.146 

But if I just run one mixed effects model then we don’t get a sense of the simple effects (how many of the variables this effect of the days between is significant and in which direction). I can run it separately for each dv to see if all difference score distributions are affected the same way depending on the days between completion.

get_delay_effect = function(df){
  mod = lm(abs(difference) ~ scale(days_btw), data = df)
  out = data.frame(estimate=coef(summary(mod))["scale(days_btw)","Estimate"], pval=coef(summary(mod))["scale(days_btw)","Pr(>|t|)"])
  return(out)
}

sig_days_effect = t1_2_difference %>%
  group_by(dv) %>%
  do(get_delay_effect(.)) %>%
  filter(pval<0.05)

sig_days_effect

For visualization I used to summarize the difference scores per person by looking at the average difference per task per subject and plot that against the number of days between completion. This yields multiple points for each value on x representing the average difference per task for each subject. But variables within a task could go in different directions (e.g. if patient proportion increases discount rate decreases) so this doesn’t seem like a good idea)

Instead I now plot all the data grouping by dependant variable and coloring depending on whether the difference score distribution for that variable has a significant slope when regressed over days between. The black is the main effect for the large multilevel model (this is the same plot as above colored by whether the difference score distribution for each dv has a significant slope depending on days between).

t1_2_difference %>%
  mutate(sig_days_effect = ifelse(dv %in% sig_days_effect$dv, 1, 0))%>%
  arrange(-sig_days_effect) %>%
  ggplot()+
  stat_smooth (aes(as.numeric(days_btw), abs(difference), 
                  group=factor(dv, levels=unique(dv[order(sig_days_effect)]), ordered=TRUE), 
                  color=factor(sig_days_effect, levels = c(1,0))),geom="line", alpha=0.5, method='lm')+
  geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
  theme(legend.title = element_blank())+
  xlab('Days between completion')+
  ylab('Scaled difference score')+
  scale_color_discrete(breaks=c(0,1),
                       labels=c("NS", "Sig"))

ggsave('DaysBtwEffect.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 6, height = 4, units = "in", limitsize = FALSE, dpi = 100)

Conclusion: The average difference between the scores of a measure if anything decreases with the increased delay. This fixed effect masks the simple effect for each dv. When looked at individually only 10 variables show a dependence on days between completion. While 7 of them show decreasing differences depending on days between 3 do show increasing differences. This suggests that the lack of stricter control over the days between completion of the measurement does not have a large effect on the stability of most measures.

effect of whether days between leads to a sig difference on reliability of that variable? plot coef of sig_day_effect over rel_df point estimate

Do the variables that show significant dependence on the delay between the completion times have systematically lower/higher reliability? Looking at the reliability point estimate over the size of the delay effect. There are very few variables to compare anyway but regardless doesn’t look conclusive/concerning.

#Create df of point estimate reliabilities
rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
                     icc = rep(NA, length(numeric_cols)),
                     pearson = rep(NA, length(numeric_cols)),
                     partial_eta_sq = rep(NA, length(numeric_cols)),
                     sem = rep(NA, length(numeric_cols)),
                     var_subs = rep(NA, length(numeric_cols)),
                     var_ind = rep(NA, length(numeric_cols)),
                     var_resid = rep(NA, length(numeric_cols)))

row.names(rel_df) <- numeric_cols

for(i in 1:length(numeric_cols)){
  rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i]) 
  rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i])
  rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i])
  rel_df[numeric_cols[i], 'partial_eta_sq'] <- get_partial_eta(numeric_cols[i])
  rel_df[numeric_cols[i], 'sem'] <- get_sem(numeric_cols[i])
  rel_df[numeric_cols[i], 'var_subs'] <- get_var_breakdown(numeric_cols[i])$subs
  rel_df[numeric_cols[i], 'var_ind'] <- get_var_breakdown(numeric_cols[i])$ind
  rel_df[numeric_cols[i], 'var_resid'] <- get_var_breakdown(numeric_cols[i])$resid
}

rel_df$dv = row.names(rel_df)
row.names(rel_df) = seq(1:nrow(rel_df))
rel_df$task = 'task'
rel_df[grep('survey', rel_df$dv), 'task'] = 'survey'
rel_df[grep('holt', rel_df$dv), 'task'] = "task"
rel_df = rel_df %>%
  select(dv, task, spearman, icc, pearson, partial_eta_sq, sem, var_subs, var_ind, var_resid)
# row.names(rel_df) = NULL
sig_days_effect %>%
  left_join(rel_df, by='dv') %>%
  ggplot(aes(estimate, icc))+
  geom_point()

Overlapping survery questions

Some surveys have overlapping questions. Do these correlate within and across sessions?

First determine the overlapping questions.

tmp = read.csv(gzfile(paste0(retest_data_path, 'items.csv.gz')))
tmp = tmp %>% 
  filter(worker == 's005') %>% 
  select(item_ID, item_text) %>% 
  mutate(item_text = trimws(as.character(item_text))) %>%
  unite(item, c("item_ID", "item_text"), sep = "___")

comb = as.data.frame(t(combn(unique(tmp$item),2)))

duplicate_items = comb %>% 
  filter(grepl('dospert', V1)==FALSE) %>%
  filter(grepl('selection_optimization', V1)==FALSE) %>%
  filter(grepl('sensation_seeking', V1)==FALSE) %>%
  separate(V1, c("item1_ID", "item1_text"), sep="___") %>%
  separate(V2, c("item2_ID", "item2_text"), sep="___") %>%
  mutate(similarity = levenshteinSim(item1_text, item2_text)) %>%
  filter(similarity>0.8) %>%
  select(item1_ID, item2_ID, item1_text, item2_text)

duplicate_items
#surveys to read in
extract_items = c('worker',unique(with(duplicate_items, c(item1_ID, item2_ID))))

#correlations to compute:
#item1_t1 - item2_t1, 
#item1_t2 - item2_t2, 
#item1_t1 - item2_t2, 
#item1_t2 - item2_t1

duplicate_items_data_t1 = read.csv(paste0(test_data_path, 'subject_x_items.csv'))
duplicate_items_data_t2 = read.csv(paste0(retest_data_path, 'subject_x_items.csv'))

duplicate_items_data_t1 = duplicate_items_data_t1 %>%
  filter(worker %in% duplicate_items_data_t2$worker) %>%
  select(extract_items)

duplicate_items_data_t2=duplicate_items_data_t2 %>%
  filter(worker %in% duplicate_items_data_t1$worker) %>%
  select(extract_items)

duplicate_items = duplicate_items %>%
  mutate(t1_t1_cor = NA,
         t2_t2_cor = NA,
         t1_t2_cor = NA,
         t2_t1_cor = NA,
         t1_t1_polycor = NA,
         t2_t2_polycor = NA,
         t1_t2_polycor = NA,
         t2_t1_polycor = NA)

for(i in 1:nrow(duplicate_items)){
  duplicate_items$t1_t1_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t2_t2_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t1_t2_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t2_t1_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])

  duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])

  duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])

  duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}

duplicate_items
summary(duplicate_items$t1_t1_polycor)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.284   0.355   0.583   0.567   0.742   0.822 
summary(duplicate_items$t2_t2_polycor)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.101   0.446   0.612   0.580   0.700   0.904 
summary(duplicate_items$t1_t2_polycor)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0128  0.3570  0.5710  0.5160  0.6870  0.8380 
summary(duplicate_items$t2_t1_polycor)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0631  0.4490  0.5350  0.5010  0.5830  0.8190 

Comparison to prior literature

Read in and process bootstrapped results.

process_boot_df = function(df){
  df = df %>%
  drop_na() %>%
  mutate(dv = as.character(dv),
         icc = as.numeric(as.character(icc)),
         spearman = as.numeric(as.character(spearman)),
         pearson = as.numeric(as.character(pearson)),
         eta_sq = as.numeric(as.character(eta_sq)),
         sem = as.numeric(as.character(sem)),
         partial_eta_sq = as.numeric(as.character(partial_eta_sq)),
         omega_sq = as.numeric(as.character(omega_sq)),
         var_subs = as.numeric(as.character(var_subs)),
         var_ind = as.numeric(as.character(var_ind)),
         var_resid = as.numeric(as.character(var_resid)),
         F_time = as.numeric(as.character(F_time)),
         p_time = as.numeric(as.character(p_time)),
         df_time = as.numeric(as.character(df_time)),
         df_resid = as.numeric(as.character(df_resid)))
  return(df)} 

boot_df <- read.csv(gzfile(paste0(retest_data_path,'bootstrap_merged.csv.gz')))
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
boot_df = process_boot_df(boot_df)

boot_df = boot_df[boot_df$dv %in% retest_report_vars,]

# Check if you have all variables bootstrapped
# retest_report_vars[which(retest_report_vars %in% boot_df$dv==FALSE)]

# Boot df contains hddm parameters fit on the full sample in the t1 data
# refits_bootstrap_merged.csv.gz contains bootstrapped reliabilities 

refit_boot_df = read.csv(gzfile(paste0(retest_data_path,'refits_bootstrap_merged.csv.gz')))
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
refit_boot_df = process_boot_df(refit_boot_df)

fullfit_boot_df = boot_df[as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]

boot_df = boot_df[!as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]

boot_df = rbind(boot_df, refit_boot_df)

rm(refit_boot_df)

Summarize bootstrapped results and merge to lit review data

var_boot_df = boot_df %>%
  group_by(dv) %>%
  summarise(mean_icc = mean(icc),
            mean_pearson = mean(pearson))

rel_comp = lit_review %>%
  left_join(var_boot_df, by = 'dv')
Warning: Column `dv` joining factor and character vector, coercing into
character vector

Here’s what our data looks like: (583 data points for 171 measures)

rel_comp

Lit review results

Distribution of reliabilities, sample sizes and delays

rel_comp %>% 
  select(dv, task, retest_reliability, sample_size, days) %>%
  filter(days < 3600) %>%
  gather(key, value, -dv, -task) %>%
  ggplot(aes(value, fill=task))+
  geom_density(alpha=0.5, position='identity')+
  facet_wrap(~key, scales='free')+
  theme(legend.title = element_blank())

summary(rel_comp$sample_size[rel_comp$task == "survey"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     11      38      73     126     153     791 
summary(rel_comp$sample_size[rel_comp$task == "task"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   10.0    30.0    47.0    78.6    81.0  1120.0 

The literature has smaller sized samples for task measures compared to survey measures that report retest reliability.

summary(lm(sample_size ~ task,rel_comp))

Call:
lm(formula = sample_size ~ task, data = rel_comp)

Residuals:
   Min     1Q Median     3Q    Max 
-115.5  -57.6  -33.6    7.4 1043.4 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   126.50       8.47   14.94  < 2e-16 ***
tasktask      -47.89      10.79   -4.44 0.000011 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 127 on 581 degrees of freedom
Multiple R-squared:  0.0328,    Adjusted R-squared:  0.0311 
F-statistic: 19.7 on 1 and 581 DF,  p-value: 0.0000108

What predicts retest reliability in the literature? Task, sample size, days

mod1 = lmer(retest_reliability ~ task + (1|dv), rel_comp)
mod2 = lmer(retest_reliability ~ task + sample_size + (1|dv), rel_comp)

anova(mod1, mod2)
refitting model(s) with ML (instead of REML)
mod3 = lmer(retest_reliability ~ task + sample_size + days+ (1|dv), rel_comp)

anova(mod2, mod3)
refitting model(s) with ML (instead of REML)
summary(mod2)
Linear mixed model fit by REML ['lmerMod']
Formula: retest_reliability ~ task + sample_size + (1 | dv)
   Data: rel_comp

REML criterion at convergence: -377.2

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.190 -0.488  0.127  0.550  2.575 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.0174   0.132   
 Residual             0.0206   0.143   
Number of obs: 583, groups:  dv, 171

Fixed effects:
              Estimate Std. Error t value
(Intercept)  0.7235198  0.0213512    33.9
tasktask    -0.1405121  0.0258197    -5.4
sample_size -0.0001256  0.0000543    -2.3

Correlation of Fixed Effects:
            (Intr) tsktsk
tasktask    -0.780       
sample_size -0.319  0.117

Tasks have significantly lower reliability and reliability decreases with increasing sample size.

rel_comp %>% 
  ggplot(aes(sample_size, retest_reliability, color=task))+
  geom_smooth(method='lm')+
  geom_point(alpha = 0.2)+
  theme(legend.title = element_blank())+
  xlab("Sample Size")+
  ylab("Retest reliability")

I used to compare effect sizes of effect of task on reliability estimates in literature vs our results (just looking at the variables you find in the literature) but removed this analysis because
1. the estimates in the literature are not all the same statistic 2. from our data I was only using ICC’s 3. I was sampling from our bootstrapped results as many datapoints for each measure as there are in the literature but that didn’t seem like the best way to make use of our data to make it comparable to the literature.

Despite these problems the main result was that the effect size was much larger in our dataset compared to the literature but given the problems I think the sampling analyses below are more informative than this.

We also checked whether our results diverge most from studies with smaller sample sizes. Square difference between our mean estimate and the reliability from the literature decreases exponentially with sample size. The smaller the sample size in the literature the more the reliability estimate differs from our results. But this was a weak result because most of the studies in the literature have smaller sample sizes and you see both small and large deviations for these studies (these were not significant either).

Direct relation to our results

Correlation between our mean estimates from bootstrapped samples and the literature review for task variables

n_df = rel_comp %>% 
  group_by(dv) %>%
  tally()

lit_emp_cor = function(){
  
  boot_comp = data.frame()
  
  for(i in 1:length(unique(rel_comp$dv)) ){
    cur_dv = unique(rel_comp$dv)[i]
    n = n_df$n[n_df$dv == cur_dv]
    sample_df = boot_df %>% filter(dv == cur_dv)
    tmp = sample_n(sample_df, n)
    boot_comp = rbind(boot_comp, tmp)
  }  
  
  rm(cur_dv, n, sample_df, tmp)
  
  #check if cbind is ok
  # sum(boot_comp$dv == rel_comp$dv)
  #cbinding pearson because that is the most common metric in the lit
  rel_comp = cbind(rel_comp, boot_comp$pearson)
  #rename new column
  names(rel_comp)[which(names(rel_comp) == "boot_comp$pearson")] = "pearson"
  
  out = data.frame(task = NA, survey = NA)
  
  out$task = with(rel_comp %>% filter(task == "task"), cor(pearson, retest_reliability))
  
  out$survey = with(rel_comp %>% filter(task == "survey"), cor(pearson, retest_reliability))
  
  rel_comp = rel_comp[,-16]
  
  return(out)
}

lit_emp_cor_out = plyr::rdply(100, lit_emp_cor)

write.csv(lit_emp_cor_out,'/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/lit_emp_cor_out.csv')

summary(lit_emp_cor_out)
       .n             task           survey      
 Min.   :  1.0   Min.   :0.235   Min.   :0.0381  
 1st Qu.: 25.8   1st Qu.:0.263   1st Qu.:0.1162  
 Median : 50.5   Median :0.274   Median :0.1392  
 Mean   : 50.5   Mean   :0.274   Mean   :0.1381  
 3rd Qu.: 75.2   3rd Qu.:0.285   3rd Qu.:0.1582  
 Max.   :100.0   Max.   :0.323   Max.   :0.2478  

Noise ceiling

Model comparisons building model to predict the reliabilities in the literature from a sample from our results versus a sample form the literature.

sample one row per measure out of lit review r^2 of retest_reliability ~ sampled_reliability vs. r^2 of retest_reliability ~ mean_icc

coef of retest_reliability ~ sampled_reliability vs. coef of retest_reliability ~ mean_icc

comp_lit_pred <- function(df){
  
  sample_from_dv <- function(df){
    if(nrow(df)>1){
      row_num = sample(1:nrow(df),1)
      sample_row = df[row_num,]
      df = df[-row_num,]
      df$lit_predictor = sample_row$retest_reliability
    }
    return(df)
  }
  
  sampled_df = df %>%
    group_by(dv) %>%
    do(sample_from_dv(.)) 
  
  mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size)+task, data=sampled_df)
  mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size)+task, data=sampled_df)
  
  out = data.frame(r2_lit = summary(mod_lit)$r.squared,
                   r2_boot = summary(mod_boot)$r.squared,
                   m_lit = coef(summary(mod_lit))["lit_predictor","Estimate"],
                   m_boot = coef(summary(mod_boot))["mean_pearson","Estimate"])
  
  return(out)
}

comp_lit_pred_out = plyr::rdply(1000, comp_lit_pred(rel_comp))
tmp = comp_lit_pred_out %>%
  select(-.n) %>%
  gather(key, value) %>%
  separate(key, c("stat", "sample"), sep = "_")
 
tmp$stat = as.factor(tmp$stat)
levels(tmp$stat) <- c("coefficient", expression(r^"2"))


tmp %>%  
  ggplot(aes(value, fill=sample))+
  geom_density(alpha = 0.5, position='identity', color=NA)+
  facet_grid(.~stat, scales='free', labeller = label_parsed)+
  scale_fill_discrete(breaks=c("boot","lit"),
                       labels=c("Empirical", "Literature"),
                      name="Prediction")+
  xlab('')+
  ylab('')

ggsave('Lit_Noise_Ceiling.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 10, height = 4, units = "in", limitsize = FALSE, dpi = 100)
with(comp_lit_pred_out, t.test(r2_lit, r2_boot, paired=T))

    Paired t-test

data:  r2_lit and r2_boot
t = 29, df = 1000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.02467 0.02821
sample estimates:
mean of the differences 
                0.02644 
with(comp_lit_pred_out, t.test(m_lit, m_boot))

    Welch Two Sample t-test

data:  m_lit and m_boot
t = 18, df = 1300, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.02791 0.03472
sample estimates:
mean of x mean of y 
   0.3334    0.3020 

Relationship between reliability metrics (point estimates)

Based on Weir (2005) ICC(3,k) does not take in to account within subject differences between two time points (i.e. the fixed effect of time/systematic error). Thus, it is well approximated by Pearson’s r and subject to similar criticisms. Weir (2005) suggests reporting at least this systematic error effect size if one chooses to report with ICC(3,k). Based on his conclusions here I report:
- ICC(3,k): As Dave clarified this ranges from 1 to -1/(number of repeated measures -1) so in our case this range would be [-1, 1]; larger values would mean that the two scores of a subject for a given measure are more similar to each other than they are to scores of other people
- “ICC is reflective of the ability of a test to differentiate between different individuals” - partial \(\eta^2\) for time (\(SS_{time}/SS_{within}\)): effect size of time
- SEM (\(\sqrt(MS_{error})\)): standard error of measurement; the smaller the better. It “quantifies the precision of individual scores on a test” and is not dependent on the sample in the way ICC is since it doesn’t depend on between subject reliability (at least in this formulation) but is unit-dependent.

We calculated 74 measures for surveys and 372 measures for cognitive tasks.

Though we are primarily reporting ICC’s as our metric of reliability the results don’t change depending on the metric chosen. Here we plot point estimates of three different reliability metrics against each other (ICC, Pearson, Spearman). The bootstrapped version is essentially the same but the plots are busier due to more datapoints.

table(rel_df$task)

survey   task 
    74    372 
p1 = rel_df %>%
  ggplot(aes(spearman, icc, col=task))+
  geom_point()+
  theme_bw()+
  theme(legend.title = element_blank(),
        legend.position = "none")+
  geom_abline(intercept = 0, slope=1)

p2 = rel_df %>%
  ggplot(aes(pearson, icc, col=task))+
  geom_point()+
  theme_bw()+
  theme(legend.title = element_blank(),
        legend.position = "none")+
  geom_abline(intercept = 0, slope=1)

p3 = rel_df %>%
  ggplot(aes(pearson, spearman, col=task))+
  geom_point()+
  theme_bw()+
  theme(legend.title = element_blank(),
        legend.position = "none")+
  geom_abline(intercept = 0, slope=1)

grid.arrange(p1, p2, p3, nrow=1)

ggsave('Metric_Scatterplots.jpg', plot = grid.arrange(p1, p2, p3, nrow=1), device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 12, height = 4, units = "in", limitsize = FALSE, dpi = 100)

As the scatter plots depict the correlations between different types reliability metrics were very high.

cor(rel_df[,c('spearman', 'icc', 'pearson')])
         spearman    icc pearson
spearman   1.0000 0.9324  0.9577
icc        0.9324 1.0000  0.9800
pearson    0.9577 0.9800  1.0000

Note: Some variables have <0 ICC’s. This would be the case if the \(MS_{error}\)>\(MS_{between}\). Data for these variables have no relationship between the two time points.

Summary of all measure reliabilities

Summarized bootstrapped reliabilities

boot_df %>%
  group_by(dv) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            spearman_median = quantile(spearman, probs = 0.5),
            spearman_2.5 = quantile(spearman, probs = 0.025),
            spearman_97.5 = quantile(spearman, probs = 0.975)) %>%
  datatable() %>%
  formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
# Df wrangling for plotting
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') 

tmp = tmp %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
  separate(var, c("var"), sep="\\.",remove=TRUE,extra="drop") %>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  arrange(task_group, var)

tmp = tmp %>%
  left_join(rel_df[,c("dv", "icc")], by = "dv") %>%
  rename(icc = icc.x, point_est = icc.y)

#Manual correction
tmp = tmp %>%
  mutate(task = ifelse(task_group == 'holt laury survey', "task", as.character(task))) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group))

Variable level summary of bootstrapped reliabilities.

p4 <- tmp %>%
  filter(task == 'task',
         raw_fit == 'raw') %>%
ggplot(aes(y = var, x = icc)) + 
  geom_point(color = '#00BFC4')+
  geom_point(aes(y = var, x = point_est), color = "black")+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
  theme(panel.spacing = unit(0.75, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180, size=36),
        axis.text.y = element_text(size=20),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80")) + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  geom_vline(xintercept = 0, color = "red", size = 1)

p5 <- tmp %>%
  filter(task == 'survey') %>%
ggplot(aes(y = var, x = icc)) + 
  geom_point(color = '#F8766D')+
  geom_point(aes(y = var, x = point_est), color = "black")+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
  theme(panel.spacing = unit(0.75, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180, size=36),
        axis.text.y = element_text(size=20),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80")) + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  geom_vline(xintercept = 0, color = "red", size = 1)
  
p6 <- arrangeGrob(p4, p5,nrow=1)

ggsave('Bootstrap_Raw_Var_Plot.jpg', plot = p6, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 36, height = 72, units = "in", limitsize = FALSE, dpi=50)

rm(p4, p5, p6)

p4_t <- tmp %>%
  filter(task == 'task') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) + 
  geom_violin(fill='#00BFC4')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p5_t <- tmp %>%
  filter(task == 'survey') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) + 
  geom_violin(fill='#F8766D')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=43))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()
  
p6_t <- arrangeGrob(p4_t, p5_t,nrow=1)

ggsave('Bootstrap_Poster_Plot_t.jpg', plot = p6_t, device = "jpeg", path = "../output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 100)

Example of the overlaying procedure.

p1<- tmp %>%
  filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = var, y = icc)) + 
  geom_violin(fill='#F8766D')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p2<- tmp %>%
  filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc, group=dv)) + 
  geom_violin(fill='#F8766D', position = 'identity')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p3<- tmp %>%
  filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) + 
  geom_violin(fill='#F8766D', position = 'identity')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p4 <- arrangeGrob(p1,p2, p3,nrow=1)

ggsave('Bootstrap_Example_Plot_t.jpg', plot = p4, device = "jpeg", path = "../output/figures/", width = 41, height = 5, units = "in", limitsize = FALSE, dpi = 100)

rm(p1, p2, p3, p4)

Survey vs Tasks

Comparison of survey measures to cognitive task measures in the bootstrapped results. Multilevel model with random intercepts for each measure and fixed effect of survey versus cognitive measure.

boot_df = boot_df %>%
    mutate(task = ifelse(grepl("survey",dv), "survey","task"),
           task = ifelse(grepl("holt",dv), "task", task))

boot_df %>%
  group_by(task) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            spearman_median = quantile(spearman, probs = 0.5),
            spearman_2.5 = quantile(spearman, probs = 0.025),
            spearman_97.5 = quantile(spearman, probs = 0.975),
            num_vars = n()/1000) %>%
  datatable() %>%
  formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
summary(lmerTest::lmer(icc ~  task + (1|dv), boot_df))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: icc ~ task + (1 | dv)
   Data: boot_df

REML criterion at convergence: -942508

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-13.201  -0.401   0.024   0.459   7.051 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.05354  0.2314  
 Residual             0.00701  0.0837  
Number of obs: 446000, groups:  dv, 446

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)   0.8725     0.0269 449.0000    32.4   <2e-16 ***
tasktask     -0.3903     0.0295 449.0000   -13.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr)
tasktask -0.913

Variance breakdown

The quantiative explanation for the difference in reliability estimates between surveys and tasks, as recently detailed by Hedge et al. (2017), lies in the difference in sources of variance between these measures. Specifically, the ICC is calculated as the ratio of variance between subjects variance to all sources of variance. Thus, measures with high between subjects variance would have high test-retest reliability. Intuitively, measures with high between subjects variance are also better suited for individual difference analyses as they would capture the differences between the subjects in a sample.

Here we first plot the percentage of variance explained by the three sources of variance for the point estimates of measure reliabilities. The plot only includes raw measures (no DDM parameters) and the measures are ranked by percentage of between subject variability for each task/survey (i.e. the best to worst individual difference measure for each task/survey). Then we compare statistically whether the percentage of variance explained by these sources differ between tasks and surveys.

tmp = rel_df %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100, 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
  select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
  mutate(dv = factor(dv, levels = dv[order(task)])) %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
  arrange(task_group, var_subs_pct) %>%
  mutate(rank = row_number()) %>%
  arrange(task, task_group, rank) %>%
  gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
  ungroup()%>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group)) %>%
  filter(task=="task",
         !grepl("EZ|hddm", dv))%>%
  arrange(task_group, rank)
labels = tmp %>%
  distinct(dv, .keep_all=T)

p1 <- tmp %>%
  ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
  geom_bar(stat='identity', alpha = 0.75, color='#00BFC4')+
  scale_x_discrete(breaks = labels$rank,
                       labels = labels$var)+
  coord_flip()+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey85"),
        legend.position = 'bottom')+
  theme(legend.title = element_blank())+
  scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
                      labels = c("Variance between individuals",
                                 "Variance between sessions",
                                 "Error variance"),
                  values=c("grey65", "grey45", "grey25"))+
  ylab("")+
  xlab("")

tmp = rel_df %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100, 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
  select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
  mutate(dv = factor(dv, levels = dv[order(task)])) %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
  arrange(task_group, var_subs_pct) %>%
  mutate(rank = row_number()) %>%
  arrange(task, task_group, rank) %>%
  gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
  ungroup()%>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group)) %>%
  filter(task=="survey")%>%
  arrange(task_group, rank)
labels = tmp %>%
  distinct(dv, .keep_all=T)

p2 <- tmp %>%
  ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
  geom_bar(stat='identity', alpha = 0.75)+
  geom_bar(stat='identity', color='#F8766D', show.legend=FALSE)+
  scale_x_discrete(breaks = labels$rank,
                       labels = labels$var)+
  coord_flip()+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey85"),
        legend.position = 'bottom')+
  theme(legend.title = element_blank())+
  scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
                      labels = c("Variance between individuals",
                                 "Variance between sessions",
                                 "Error variance"),
                  values=c("grey65", "grey45", "grey25"))+
  ylab("")+
  xlab("")

mylegend<-g_legend(p2)

p3 <- arrangeGrob(arrangeGrob(p1 +theme(legend.position="none"),
                         p2 + theme(legend.position="none"),
                         nrow=1),
             mylegend, nrow=2,heights=c(10, 1))

ggsave('Variance_Breakdown_Plot.jpg', plot = p3, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 24, height = 20, units = "in", dpi = 100)
rm(tmp, labels, p1, p2 , p3)

Comparing types of variance for survey vs task measures: Survey measures have higher between subject variability

Note: This analysis includes DDM variables too.

Running separate models for different sources of variance because interactive model with variance type*task seemed too complicated.

First we find that task measures have a smaller percentage of their overall variance explained by variability between subjects compared to survey measures.

tmp = boot_df %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100, 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
  select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) 

summary(lmerTest::lmer(var_subs_pct~task+(1|dv),tmp%>%select(-var_ind_pct,-var_resid_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: var_subs_pct ~ task + (1 | dv)
   Data: tmp %>% select(-var_ind_pct, -var_resid_pct)

REML criterion at convergence: 3304624

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.316 -0.621  0.103  0.647  5.381 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 188      13.7    
 Residual              96       9.8    
Number of obs: 446000, groups:  dv, 446

Fixed effects:
            Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)    80.33       1.60 443.00    50.4   <2e-16 ***
tasktask      -28.81       1.75 443.00   -16.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr)
tasktask -0.913

We also find that a significantly larger percentage of their variance is explained by between session variability. Larger between session variability suggests systematic differences between the two sessions. Such systematic effects can be due to e.g. learning effects as explored later.

summary(lmerTest::lmer(var_ind_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_resid_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: var_ind_pct ~ task + (1 | dv)
   Data: tmp %>% select(-var_subs_pct, -var_resid_pct)

REML criterion at convergence: 3611691

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.412 -0.651 -0.139  0.584  4.736 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 239      15.5    
 Residual             191      13.8    
Number of obs: 446000, groups:  dv, 446

Fixed effects:
            Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)     9.84       1.80 444.00    5.47  7.4e-08 ***
tasktask       14.36       1.97 444.00    7.29  1.4e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr)
tasktask -0.913
summary(lmerTest::lmer(var_resid_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_ind_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: var_resid_pct ~ task + (1 | dv)
   Data: tmp %>% select(-var_subs_pct, -var_ind_pct)

REML criterion at convergence: 2777592

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-6.323 -0.475  0.018  0.530  6.343 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 75.2     8.67    
 Residual             29.4     5.43    
Number of obs: 446000, groups:  dv, 446

Fixed effects:
            Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)     9.83       1.01 446.00    9.75   <2e-16 ***
tasktask       14.45       1.10 446.00   13.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr)
tasktask -0.913
tmp_save = tmp%>%
  gather(key, value, -dv, -task) %>%
  group_by(task, key) %>%
  summarise(median = median(value),
            sd = sd(value)) %>%
  mutate(key = ifelse(key == 'var_ind_pct', 'Between session variance', ifelse(key == 'var_subs_pct', 'Between subjects variance', ifelse(key == 'var_resid_pct', 'Residual variance',NA)))) %>%
  rename(Median = median, SD = sd) %>%
  arrange(task, key)

tmp_save
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/var_breakdown_summary.doc")
task key Median SD
survey Between session variance 5.384 11.353
survey Between subjects variance 83.355 11.686
survey Residual variance 8.976 4.325
task Between session variance 18.077 22.114
task Between subjects variance 52.548 17.681
task Residual variance 22.818 11.012

Summarizing for clearer presentation. This graph is currently using the bootstrapped reliabilities and is therefore messier than if just using the point estimates.

tmp %>%
  gather(key, value, -dv, -task) %>%
  group_by(task, key) %>%
  summarise(mean_pct = mean(value),
            sd_pct = sd(value),
            n = n()) %>%
  mutate(cvl = qt(0.025, n-1),
         cvu = qt(0.975, n-1),
         cil = mean_pct+(sd_pct*cvl)/sqrt(n),
         ciu = mean_pct+(sd_pct*cvu)/sqrt(n),
         sem_pct = sd_pct/sqrt(n)) %>%
  ggplot(aes(factor(key, levels = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
                    labels = c("Variance between individuals",
                               "Variance between sessions",
                               "Error variance")), mean_pct, color=task))+
  geom_point(position=position_dodge(width = 0.25), size = 4)+
  # geom_errorbar(aes(ymin=mean_pct-sd_pct, ymax=mean_pct+sd_pct), position=position_dodge(width = 0.25), width=0, size=2)+
  geom_errorbar(aes(ymin=cil, ymax=ciu), position=position_dodge(width = 0.25), width=0.25, size=2)+
  theme_bw()+
  xlab('')+
  ylab('Percent')+
  theme(legend.title = element_blank(),
        legend.text = element_text(size=14),
        legend.position = 'bottom',
        axis.text = element_text(size=12),
        axis.title.y = element_text(size=12))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))

ggsave('Variance_Breakdown_DotPlot.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 6, height = 5, units = "in", dpi = 100)

Systematic effects between time points

Anova time effects

The type of ICC we have chosen does not take within subject (between session/systematic) variance in to account. This is why Weir recommends checking whether there is a significant change based on time and examining the SEMs. These systematic effects could be meaningful and important to account for for some measures (e.g. task measures that show learning effects).

Had we chosen another kind of ICC taking this source of variance into account (e.g. 2,1 or 2,k) they could have suggested that tasks have lower reliability.

Doing a simple t-test on the difference score alone would not be a very rigourous way of testing whether any change is meaningful because two distributions from both time points with error would be compared to each other. Fortunately there are ways to take the error for both measurements in to account.

To check whether a measure shows systematic differences between the two time points in a meaningful number of bootstrapped samples we can: check if the effect of time is significant in each bootstrapped sample and filter variables that have more than 5% of the boostrapped samples showing significant time effects.

Another way might be to compute confidence intervals using SEMs as described in the second half of Weir (2005) and check what percent of participants have scores that fall out of this range. I haven’t pursued this for now.

Here we ask: Which variables have significant time effects in more than 5% of the bootstrapped samples?

23/74 survey measures 133/372 task measures

boot_df %>%
  select(dv, p_time, task) %>%
  mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
  group_by(dv)%>%
  summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
            task = unique(task))%>%
  filter(pct_sig_time_effect>5) %>%
  arrange(task,-pct_sig_time_effect) %>%
  ungroup()%>%
  group_by(task) %>%
  summarise(count=n())

Are these significantly different between surveys and tasks? No.

chisq.test(matrix(data = c(23,74-23,133, 372-133), nrow=2))

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(data = c(23, 74 - 23, 133, 372 - 133), nrow = 2)
X-squared = 0.4, df = 1, p-value = 0.5

Are they predominantly in one or the other direction and does that differ depending on survey vs task?

boot_df %>%
  select(dv, p_time, task, pearson) %>%
  mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
  group_by(dv)%>%
  summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
            task = unique(task),
            mean_pearson = mean(pearson))%>%
  filter(pct_sig_time_effect>5) %>%
  arrange(task,-pct_sig_time_effect) %>%
  arrange(mean_pearson) 

SEM CI calculation

Step 1: T = Grand mean + ICC * (Subject score - Grand mean) Step 2: SEP = SD of both measurements * sqrt(1-ICC^2)

Here I calculate the proportion of subjects that move more than one standard error of prediction (SEP) in t2 compared to t1 for each measure.

This is odd to think about because the larger the ICC of a measure the smaller the SEP. So very small differences between the two time points can be categorized as ‘meaningful’ based on the tiny SEP.

What you are interested in is not necessarily whether individuals change at all between the two time points though. You want to know if this change is systematic in one direction.

Here I calculate the proportion of people showing ‘meaningful’ changes in one direction or the other. To integrate both of these direction I subtracted one from the other and filtered the variables that have more than 5% of the participants showing meaningful change in one direction over the other (so if a variable has 10 participants showing difference in one and 10 in the other direction this would cancel out but a variables with 15 people showing a positive and 5 negative change would remain).

get_ind_ci = function(dv_var){
  matched = match_t1_t2(dv_var)
  grand_mean = mean(matched$score)
  grand_sd = sd(matched$score)
  dv_icc = get_icc(dv_var)
  sep = grand_sd * sqrt(1-(dv_icc^2))
  matched = matched %>% 
    spread(time, score) %>%
    rename("t1"="1", "t2"="2") %>%
    mutate(true_score = grand_mean+(dv_icc*(t1-grand_mean)),
           ci_up = true_score+(1.96*sep),
           ci_low = true_score-(1.96*sep),
           t2_above_ci = ifelse(t2>ci_up,1,0),
           t2_below_ci = ifelse(t2<ci_low,1, 0))
    return(matched)
}

get_prop_out_ci = function(dv_var){
  get_ind_ci(dv_var) %>%
    summarise(prop_above_ci = sum(t2_above_ci)/n(),
              prop_below_ci = sum(t2_below_ci/n()))
}

#Create df of point estimate reliabilities
ind_ci_df <- data.frame(prop_above_ci = rep(NA, length(numeric_cols)),
                        prop_below_ci = rep(NA, length(numeric_cols)))

row.names(ind_ci_df) <- numeric_cols

for(i in 1:length(numeric_cols)){
    ind_ci_df[numeric_cols[i], 'prop_above_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_above_ci 
    ind_ci_df[numeric_cols[i], 'prop_below_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_below_ci 
}

ind_ci_df$dv = row.names(ind_ci_df)
row.names(ind_ci_df) = seq(1:nrow(ind_ci_df))
ind_ci_df$task = 'task'
ind_ci_df[grep('survey', ind_ci_df$dv), 'task'] = 'survey'
ind_ci_df[grep('holt', ind_ci_df$dv), 'task'] = "task"
ind_ci_df = ind_ci_df %>%
  select(dv, task, prop_above_ci, prop_below_ci)

mean_diff_df = ind_ci_df %>%
  mutate(prop_one_direction = prop_above_ci-prop_below_ci) %>%
  filter(abs(prop_one_direction)>0.05) %>%
  arrange(-abs(prop_one_direction))

mean_diff_df
# write.csv(mean_diff_df, '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/mean_diff_df.csv')

This doesn’t make it easier to interpret whether it is a performance improvement or decline. In fact ‘performance’ isn’t even necessarily the correct term here. Using the valence assignments of the measures look at whether it translates to an increase or decrease in self-regulation. Of the 66 this makes sense for 54 variables.

mean_diff_df = read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/mean_diff_df.csv')

table(as.character(mean_diff_df$valence), useNA = "ifany")

 Neg  Pos <NA> 
  24   30   12 
mean_diff_df %>%
  mutate(sc_increase = ifelse(prop_one_direction>0&valence=="Pos",1,ifelse(prop_one_direction<0&valence=="Neg",1,0)),
         sc_decrease = ifelse(prop_one_direction>0&valence=="Neg",1,ifelse(prop_one_direction<0&valence=="Pos",1,0))) %>%
summarise(sum_sc_increase = sum(sc_increase,na.rm=T),
          sum_sc_decrease = sum(sc_decrease,na.rm=T))

Variables that show more than 5% difference in a direction do not depend on whether they translate to a self control increase or decrease.

chisq.test(matrix(data = c(23,54-23,31,54-31), nrow=2))

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(data = c(23, 54 - 23, 31, 54 - 31), nrow = 2)
X-squared = 1.8, df = 1, p-value = 0.2

Do they differ depending on whether they are task or survey variables? No.

chisq.test(matrix(data = c(5,74-5,61,372-61), nrow=2))

    Pearson's Chi-squared test with Yates' continuity correction

data:  matrix(data = c(5, 74 - 5, 61, 372 - 61), nrow = 2)
X-squared = 3.8, df = 1, p-value = 0.05
rm(mean_diff_df)

Task Reliabilities

Here we summarize the results on a task level to make it more digestable and easier to make contact with the literature.

We reduce the list of task measures to a list of one per task by averaging only the raw measures from all the trials in a task. We chose to reduce the information in this manner to avoid any bias stemming from differential amount of interest and procedures applied to certain tasks over others (e.g. a task can have over 10 measures because it has multiple conditions and we have chosen to fit DDM’s for specific conditions while another might only have 2 due to our relative inexperience and lack of interest in it). We check whether the number of trials in a task has a significant effect on these average reliabilities of raw measures as well.

We filter out the DDM parameters and measures for specific contrasts. Note that this does leave some tasks with measures that are model fits and/or for specific conditions (because at least the current datasets do not include measures that are based on all the trials and are raw though I could dive in to variables_exhaustive for such measures. For example the average relialibility for Kirby is based on three discount rates for specific conditions.). Here’s the order of tasks by mean reliability sorted for ICC and then Spearman’s \(\rho\).

tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  group_by(task_name) %>%
  summarise(median_icc = median(icc),
            median_spearman = median(spearman),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            min_spearman = min(spearman),
            max_spearman = max(spearman),
            num_measures = n()/1000,
            mean_num_trials = round(mean(num_all_trials)))%>%
  arrange(-median_icc, -median_spearman)

tmp %>%
  datatable() %>%
  formatRound(columns=c('median_spearman', 'median_icc',
                        'min_spearman', 'icc_2.5',
                        'max_spearman', 'icc_97.5'), digits=3)
tmp = tmp%>%
  select(-median_spearman, -min_spearman, -max_spearman) %>%
  mutate(task_name = gsub("_", " ", task_name),
         task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE))

names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)

sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/task_rel_table.doc")
Task Name Median Icc Icc 2.5 Icc 97.5 Num Measures Mean Num Trials
Ravens 0.875 0.845 0.904 1 18
Adaptive N Back 0.842 0.362 0.911 3 380
Kirby 0.835 0.746 0.897 12 14
Cognitive Reflection Survey 0.783 0.669 0.865 2 6
Threebytwo 0.772 0.694 0.845 2 383
Recent Probes 0.772 0.629 0.847 2 68
Stroop 0.76 0.445 0.844 7 67
Simple Reaction Time 0.738 0.668 0.826 1 149
Psychological Refractory Period Two Choices 0.732 0.499 0.85 4 186
Choice Reaction Time 0.727 0.506 0.848 2 150
Hierbackground-color:#eaeaea;hical Rule 0.707 0.592 0.792 3 298
Attention Network Task 0.703 -0.513 0.814 6 75
Simon 0.692 0.398 0.821 8 61
Directed Forgetting 0.681 0.479 0.782 2 67
Digit Span 0.675 0.544 0.786 2 10
Shape Matching 0.673 0.57 0.763 2 272
Information Sampling Task 0.67 0.118 0.859 8 10
Stop Signal 0.661 0.261 0.815 11 304
Spatial Span 0.658 0.382 0.794 2 10
Discount Titrate 0.645 0.496 0.753 1 36
Go Nogo 0.643 0.256 0.785 5 210
Keep Track 0.641 0.497 0.719 1 9
Tower Of London 0.634 0.436 0.855 4 12
Dot Pattern Expectancy 0.599 0.111 0.798 11 61
Columbia Card Task Hot 0.597 0.433 0.857 5 24
Stim Selective Stop Signal 0.577 0.375 0.752 4 104
Two Stage Decision 0.575 0.294 0.746 4 198
Local Global Letter 0.57 0.285 0.76 12 31
Dietary Decision 0.564 0.088 0.749 3 48
Columbia Card Task Cold 0.524 0.248 0.688 5 24
Holt Laury Survey 0.524 -0.122 0.659 3 10
Shift Task 0.476 -0.056 0.771 13 401
Angling Risk Task Always Sunny 0.428 0.039 0.636 6 21
Motor Selective Stop Signal 0.423 -0.357 0.798 4 63
Probabilistic Selection 0.402 -0.194 0.752 7 54
Writing Task 0.4 0.207 0.538 2 1
Bickel Titrator 0.066 -0.033 0.783 4 30

Number of trials

Does number of items in a task have a significant effect on the average ICC of (mostly) raw measures for all trials from a task? No. (no effect on Spearman either)

tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2)

# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: icc ~ num_all_trials + (1 | dv)
   Data: tmp

REML criterion at convergence: -408213

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-12.548  -0.460   0.037   0.516   7.553 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.04556  0.2135  
 Residual             0.00555  0.0745  
Number of obs: 174000, groups:  dv, 174

Fixed effects:
               Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)    5.81e-01   2.11e-02 1.72e+02   27.60   <2e-16 ***
num_all_trials 2.12e-05   1.19e-04 1.72e+02    0.18     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
num_ll_trls -0.640
measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  ggplot(aes(num_all_trials, icc))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_bw()+
  xlab("Number of trials")+
  ylab("ICC")

Trial number dependence intrameasure

The above analysis was looking at the effect of number of trials across tasks. But some tasks might be bad for individual difference measurement regardless of how many trials there are in them whereas for others fewer trials might be yielding a sufficiently reliable measure.

For tasks for which dependent variables are estimated using many trials one can ask: Does the same measure get less reliable if fewer trials are used to estimate its reliability?

This won’t make sense for all tasks. For example to estimate a risk aversion parameter you need all trials for Holt and Laury. For Kirby and Bickel you have specific conditions looking at fewer trials. The Cognitive Reflection Task might be more appropriate to analyze each item seaprately. The writing task does not have trial numbers. For all others it might be interesting to investigate.

These kinds of analyses are too task-specific and in-depth for a paper that is trying to give a global sense of the differences between self-regulation measures in their suitablity for individual difference analyses based on their stability across time. Such analyses would provide a detailed examination of how to extract the most reliable/best individual difference measure from tasks with a set of mediocre variables to begin with. Though we do not provide such a comprehensive analysis of this sort in this paper we provide a single example of this approach and hope the open access we provide to the data spurs further work.

For this example we look at the retest reliability of the three dependent measures (average accuracy, median response time, total score) from the hierarchical rule task with 360 trials. Here is a graph of how the point estimates of the retest reliability changes for each of the dependent measures using different numbers of trials to estimate them. While the reliability estimate for each of the variables respectively are …, … and … using all of the trials, these values are independent of number of trials used in the task for only certain variables. Based on this analysis researchers might decide to use a version of the task with fewer trials or only measures with consistently high reliability estimates.

Example task: three by two

Post process dv’s from cluster to calculate reliabilities

t1_dvs = read.csv(paste0(retest_data_path, 't1_tbt_dvs.csv'))
t2_dvs = read.csv(paste0(retest_data_path, 't2_tbt_dvs.csv'))

t1_dvs = t1_dvs %>% select(-X)
t2_dvs = t2_dvs %>% select(-X)
hr_merge = merge(t1_dvs, t2_dvs, by = c("sub_id", "breaks"))

hr_merge = hr_merge %>%
  gather(key, value, -sub_id, -breaks) %>%
  separate(key, c("dv", "time"), sep="\\.") %>%
  mutate(time = ifelse(time == "x", 1, 2))

t1_dvs = hr_merge %>%
  filter(time == 1) %>%
  select(-time) %>%
  spread(dv, value)

t2_dvs = hr_merge %>%
  filter(time == 2) %>%
  select(-time) %>%
  spread(dv, value)

# calculate point estimates for reliability of each of the variables for each break
# get_icc for each break of tmp_t1_dvs and tmp_t2_dvs

trial_num_rel_df = data.frame(breaks=rep(NA, length(unique(t1_dvs$breaks))),
                              acc_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              avg_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              std_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              avg_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              std_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              missed_percent_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))))

for(i in 1:length(unique(t1_dvs$breaks))){
  cur_break = unique(t1_dvs$breaks)[i]
  tmp_t1_dvs = t1_dvs %>% filter(breaks == cur_break)
  tmp_t2_dvs = t2_dvs %>% filter(breaks == cur_break)
  trial_num_rel_df$breaks[i] = cur_break
  trial_num_rel_df$acc_icc[i] = get_icc("acc", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$avg_rt_error_icc[i] = get_icc("avg_rt_error", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$std_rt_error_icc[i] = get_icc("std_rt_error", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$avg_rt_icc[i] = get_icc("avg_rt", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$std_rt_icc[i] = get_icc("std_rt", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$missed_percent_icc[i] = get_icc("missed_percent", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_rt_100_icc[i] = get_icc("cue_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_rt_900_icc[i] = get_icc("cue_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_rt_100_icc[i] = get_icc("task_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_rt_900_icc[i] = get_icc("task_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_acc_100_icc[i] = get_icc("cue_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_acc_900_icc[i] = get_icc("cue_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_acc_100_icc[i] = get_icc("task_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_acc_900_icc[i] = get_icc("task_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
}
rm(i, cur_break, tmp_t1_dvs, tmp_t2_dvs)

trial_num_rel_df$breaks = as.numeric(trial_num_rel_df$breaks)

write.csv(trial_num_rel_df, paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))
# trial_num_rel_df = read.csv(paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))

cols <- c("Accuracy" = '#084594', "Average RT error" = '#99000d', "SD RT error" = '#cb181d', "Average RT correct" = '#ef3b2c', "SD RT correct" = '#fb6a4a', "Missed percentage" = '#2171b5', "Cue switch cost RT (100)" = '#fc9272', "Cue switch cost RT (900)" = '#fcbba1', "Task switch cost RT (100)" = '#fee0d2', "Task switch cost RT (900)" = '#fff5f0', "Cue switch cost Acc (100)" = '#4292c6', "Cue switch cost Acc (900)" = '#6baed6', "Task switch cost Acc (100)" = '#9ecae1', "Task switch cost Acc (900)" = '#c6dbef')

trial_num_rel_df %>%
  gather(key, value, -breaks) %>%
  ggplot(aes(breaks*10, value, color=factor(key, levels = c("acc_icc", "avg_rt_error_icc", "std_rt_error_icc", "avg_rt_icc", "std_rt_icc", "missed_percent_icc", "cue_switch_cost_rt_100_icc", "cue_switch_cost_rt_900_icc", "task_switch_cost_rt_100_icc", "task_switch_cost_rt_900_icc", "cue_switch_cost_acc_100_icc", "cue_switch_cost_acc_900_icc", "task_switch_cost_acc_100_icc", "task_switch_cost_acc_900_icc"), labels = c("Accuracy", "Average RT error", "SD RT error", "Average RT correct", "SD RT correct", "Missed percentage", "Cue switch cost RT (100)", "Cue switch cost RT (900)", "Task switch cost RT (100)", "Task switch cost RT (900)", "Cue switch cost Acc (100)", "Cue switch cost Acc (900)", "Task switch cost Acc (100)", "Task switch cost Acc (900)"))))+
  geom_point()+
  geom_line()+
  theme_bw()+
  xlab("Number of trials")+
  ylab("ICC")+
  theme(legend.title = element_blank(),
        legend.position = 'bottom')+
  scale_color_manual(values = cols,
                     breaks = c("Average RT error", "Accuracy", "SD RT error", "Missed percentage", "Average RT correct", "Cue switch cost Acc (100)", "SD RT correct", "Cue switch cost Acc (900)", "Cue switch cost RT (100)", "Task switch cost Acc (100)", "Task switch cost RT (100)", "Task switch cost Acc (900)", "Task switch cost RT (900)"))+
  ylim(-1,1)+
  guides(color = guide_legend(ncol=2, byrow=T))

ggsave('Intrameasure_Trialnum_Dependendence.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 10, height = 5, units = "in", dpi = 100)

Raw vs DDM

Checking DDM results in the bootstrapped estimates. Variables using all trials are significantly more reliable compared to difference scores. Raw measures don’t differ from DDM parameters. Which DDM is better depends on whether all trials are used.

tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != 'other') %>%
  drop_na() %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')

tmp %>%
  drop_na() %>% #try removing this in final release
  group_by(overall_difference, raw_fit, rt_acc) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            spearman_median = quantile(spearman, probs = 0.5),
            spearman_2.5 = quantile(spearman, probs = 0.025),
            spearman_97.5 = quantile(spearman, probs = 0.975),
            num_vars = n()/1000) %>%
  datatable() %>%
  formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
tmp_save = tmp %>%
  drop_na() %>% #try removing this in final release
  group_by(overall_difference, raw_fit, rt_acc) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            num_vars = n()/1000) %>%
  ungroup() %>%
  mutate(overall_difference = as.character(overall_difference),
         raw_fit = as.character(raw_fit),
         rt_acc = as.character(rt_acc)) %>%
  arrange(-icc_median)

sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/ddm_rel_table.doc")
overall_difference raw_fit rt_acc icc_median icc_2.5 icc_97.5 num_vars
overall raw rt 0.736 0.535 0.854 16
overall EZ drift rate 0.724 0.528 0.826 14
overall raw accuracy 0.684 -0.375 0.832 14
overall hddm drift rate 0.684 0.387 0.816 16
overall EZ non-decision 0.647 0.197 0.794 14
overall EZ threshold 0.624 0.384 0.823 14
overall hddm threshold 0.544 0.316 0.668 14
overall hddm non-decision 0.449 -0.032 0.671 14
difference hddm drift rate 0.404 -0.242 0.651 18
difference raw rt 0.299 -0.193 0.63 25
difference raw accuracy 0.27 -0.245 0.698 13
difference EZ drift rate 0.248 -0.196 0.579 17
difference EZ non-decision 0.236 -0.204 0.652 17
difference EZ threshold 0.097 -0.449 0.537 17
difference hddm threshold 0.09 -0.201 0.345 1

Comparing overall vs difference: overall has higher reliability than difference.

summary(lmerTest::lmer(icc ~ overall_difference + (1|dv) ,tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: icc ~ overall_difference + (1 | dv)
   Data: tmp

REML criterion at convergence: -401676

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-11.245  -0.467   0.041   0.521   6.005 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.03486  0.1867  
 Residual             0.00966  0.0983  
Number of obs: 224000, groups:  dv, 224

Fixed effects:
                          Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)                  0.245      0.018 221.400    13.6   <2e-16 ***
overall_differenceoverall    0.368      0.025 221.400    14.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
ovrll_dffrn -0.720

“consistent with the summing of variance in the difference score”

tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != 'other') %>%
  drop_na() %>%
  left_join(boot_df[,c("dv", "icc", "var_subs","var_ind", "var_resid")], by = 'dv') %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid),
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid), 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)) %>%
  select(-task, -ddm_task, -num_all_trials, -var_resid, -var_subs, -var_ind) %>%
  gather(key, value, -dv, -overall_difference, -raw_fit, -rt_acc, -icc)

tmp %>%
  filter(key == "var_resid_pct") %>%
  ggplot(aes(raw_fit, value, fill=overall_difference))+
  geom_boxplot()

summary(lmer(value  ~ overall_difference*raw_fit + (1|dv), data=tmp %>% filter(key=="var_resid_pct")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference * raw_fit + (1 | dv)
   Data: tmp %>% filter(key == "var_resid_pct")

REML criterion at convergence: -608085

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-5.529 -0.543  0.020  0.601  5.547 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.00539  0.0734  
 Residual             0.00385  0.0620  
Number of obs: 224000, groups:  dv, 224

Fixed effects:
                                      Estimate Std. Error t value
(Intercept)                             0.3437     0.0103    33.4
overall_differenceoverall              -0.1594     0.0153   -10.4
raw_fithddm                            -0.1070     0.0197    -5.4
raw_fitraw                             -0.0320     0.0157    -2.0
overall_differenceoverall:raw_fithddm   0.1270     0.0253     5.0
overall_differenceoverall:raw_fitraw    0.0240     0.0236     1.0

Correlation of Fixed Effects:
                        (Intr) ovrll_ rw_fth rw_ftr
ovrll_dffrn             -0.672                     
raw_fithddm             -0.521  0.350              
raw_fitraw              -0.653  0.439  0.340       
ovrll_dffrncvrll:rw_fth  0.406 -0.605 -0.780 -0.265
ovrll_dffrncvrll:rw_ftr  0.436 -0.649 -0.227 -0.668
                        ovrll_dffrncvrll:rw_fth
ovrll_dffrn                                    
raw_fithddm                                    
raw_fitraw                                     
ovrll_dffrncvrll:rw_fth                        
ovrll_dffrncvrll:rw_ftr  0.392                 
#Simple effects
summary(lmer(value  ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "EZ")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference + (1 | dv)
   Data: tmp %>% filter(key == "var_resid_pct", raw_fit == "EZ")

REML criterion at convergence: -241956

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-5.224 -0.536  0.041  0.607  4.544 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.00448  0.0669  
 Residual             0.00431  0.0657  
Number of obs: 93000, groups:  dv, 93

Fixed effects:
                          Estimate Std. Error t value
(Intercept)                0.34371    0.00938    36.7
overall_differenceoverall -0.15937    0.01395   -11.4

Correlation of Fixed Effects:
            (Intr)
ovrll_dffrn -0.672
summary(lmer(value  ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "raw")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference + (1 | dv)
   Data: tmp %>% filter(key == "var_resid_pct", raw_fit == "raw")

REML criterion at convergence: -184635

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.952 -0.522  0.021  0.590  4.769 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.00525  0.0724  
 Residual             0.00385  0.0620  
Number of obs: 68000, groups:  dv, 68

Fixed effects:
                          Estimate Std. Error t value
(Intercept)                 0.3117     0.0118   26.51
overall_differenceoverall  -0.1354     0.0177   -7.65

Correlation of Fixed Effects:
            (Intr)
ovrll_dffrn -0.664
summary(lmer(value  ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "hddm")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference + (1 | dv)
   Data: tmp %>% filter(key == "var_resid_pct", raw_fit == "hddm")

REML criterion at convergence: -183232

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-5.198 -0.592 -0.008  0.605  6.113 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.00689  0.0830  
 Residual             0.00317  0.0563  
Number of obs: 63000, groups:  dv, 63

Fixed effects:
                          Estimate Std. Error t value
(Intercept)                 0.2368     0.0190   12.43
overall_differenceoverall  -0.0324     0.0228   -1.42

Correlation of Fixed Effects:
            (Intr)
ovrll_dffrn -0.836

Checking Rogosa’s claim that ‘Difference scores are reliable when individual differences in true change exist.’ How do we measure ‘individual difference in true change’ 1.between subject variability in difference score distributions 2.within subject variance in icc Using 2 (because comparable between subject variance across different measures would be hard) Expecting to see: positive correlation between mean icc and mean var_ind_pct Result: The correlation between mean icc and mean var ind pct is NOT significant BUT looking at the distribution of mean var ind pct there is not a lot of variability t

tmp = tmp %>%
  spread(key, value) %>%
  group_by(dv) %>%
  summarise(mean_var_ind_pct = mean(var_ind_pct),
            mean_var_subs_pct = mean(var_subs_pct),
            mean_icc = mean(icc)) %>%
  left_join(measure_labels, by='dv')
Warning: Column `dv` joining character vector and factor, coercing into
character vector
tmp %>%
  filter(overall_difference == 'difference') %>%
  ggplot(aes(mean_var_ind_pct, mean_icc))+
  geom_point()+
  geom_abline(slope=1, intercept=0)+
  geom_smooth(method = "lm")

summary(lm(mean_icc ~ mean_var_ind_pct,data=tmp %>%
  filter(overall_difference == 'difference')))

Call:
lm(formula = mean_icc ~ mean_var_ind_pct, data = tmp %>% filter(overall_difference == 
    "difference"))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6262 -0.1294 -0.0076  0.1613  0.3764 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.2302     0.0366    6.29  7.1e-09 ***
mean_var_ind_pct   0.0587     0.1197    0.49     0.62    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.212 on 106 degrees of freedom
Multiple R-squared:  0.00227,   Adjusted R-squared:  -0.00715 
F-statistic: 0.241 on 1 and 106 DF,  p-value: 0.625
tmp %>%
  filter(overall_difference == 'difference') %>%
  ggplot(aes(mean_var_ind_pct))+
  geom_density()

summary(tmp$mean_var_ind_pct[tmp$overall_difference == "difference"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0917  0.1350  0.1820  0.2540  0.3350  0.9560 
summary(tmp$mean_var_subs_pct[tmp$overall_difference == "difference"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0255  0.3690  0.4440  0.4330  0.5170  0.6300 
tmp = tmp %>% filter(overall_difference == 'difference')
t.test(tmp$mean_var_ind_pct, tmp$mean_var_subs_pct, paired=T)

    Paired t-test

data:  tmp$mean_var_ind_pct and tmp$mean_var_subs_pct
t = -6.7, df = 110, p-value = 8e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2318 -0.1265
sample estimates:
mean of the differences 
                -0.1791 
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != 'other') %>%
  drop_na() %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')

Comparing raw vs ddm in overall estimates: EZ is significantly better than HDDM and comparable to raw estimates.

summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(overall_difference == "overall")))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: icc ~ raw_fit + (1 | dv)
   Data: tmp %>% filter(overall_difference == "overall")

REML criterion at convergence: -278565

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-11.366  -0.485   0.031   0.510   8.141 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.02299  0.1516  
 Residual             0.00526  0.0725  
Number of obs: 116000, groups:  dv, 116

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)   0.6493     0.0234 113.4000   27.75   <2e-16 ***
raw_fithddm  -0.1068     0.0327 113.4000   -3.27   0.0014 ** 
raw_fitraw    0.0186     0.0362 113.4000    0.51   0.6080    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) rw_fth
raw_fithddm -0.715       
raw_fitraw  -0.645  0.462

Comparing raw vs ddm in difference scores: EZ is significantly worse than HDDM and comparable to raw estimates.

summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(overall_difference == "difference")))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: icc ~ raw_fit + (1 | dv)
   Data: tmp %>% filter(overall_difference == "difference")

REML criterion at convergence: -150642

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-9.214 -0.535  0.065  0.615  4.172 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.0423   0.206   
 Residual             0.0144   0.120   
Number of obs: 108000, groups:  dv, 108

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)   0.1900     0.0288 105.1000    6.59  1.8e-09 ***
raw_fithddm   0.1425     0.0553 105.1000    2.58    0.011 *  
raw_fitraw    0.0855     0.0441 105.1000    1.94    0.055 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) rw_fth
raw_fithddm -0.521       
raw_fitraw  -0.653  0.340
tmp %>%
  ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
  geom_boxplot()+
  facet_wrap(~factor(overall_difference, levels=c("overall", "difference"), labels=c("Overall", "Difference")))+
  theme_bw()+
  ylab("ICC")+
  xlab("")+
  theme(legend.title = element_blank(),
        legend.position = 'bottom',
        legend.text = element_text(size=16),
        strip.text = element_text(size=16),
        axis.text = element_text(size = 16),
        text = element_text(size=16))+
  guides(fill = guide_legend(ncol = 2))+
  scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))

ggsave('Bootstrap_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)
tmp2 = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != "other") %>%
  drop_na() %>%
  left_join(rel_df[,c("dv", "icc", "spearman")], by = 'dv')

tmp2 %>%
  ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
  geom_boxplot()+
  facet_wrap(~factor(overall_difference, levels=c("overall", "difference"), labels=c("Overall", "Difference")))+
  theme_bw()+
  ylab("ICC")+
  xlab("")+
  theme(legend.title = element_blank(),
        legend.position = 'bottom',
        legend.text = element_text(size=16),
        strip.text = element_text(size=16),
        axis.text = element_text(size = 16),
        text = element_text(size=16))+
  guides(fill = guide_legend(ncol = 2))+
  scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))

ggsave('PointEst_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)

Survey reliabilities

tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'survey') %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  group_by(task_name) %>%
  summarise(median_icc = median(icc),
            median_spearman = median(spearman),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            min_spearman = min(spearman),
            max_spearman = max(spearman),
            num_measures = n()/1000,
            mean_num_trials = round(mean(num_all_trials)))%>%
  arrange(-median_icc, -median_spearman)

tmp %>%
  datatable() %>%
  formatRound(columns=c('median_spearman', 'median_icc',
                        'min_spearman', 'icc_2.5',
                        'max_spearman', 'icc_97.5'), digits=3)
tmp = tmp%>%
  select(-min_spearman, -max_spearman,-median_spearman) %>%
  mutate(task_name = gsub("_", " ", task_name),
         task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE),
         task_name = gsub("Survey", "", task_name))

names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)

sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/survey_rel_table.doc")
Task Name Median Icc Icc 2.5 Icc 97.5 Num Measures Mean Num Trials
Self Regulation 0.95 0.936 0.961 1 31
Grit Scale 0.942 0.924 0.958 1 8
Brief Self Control 0.941 0.92 0.966 1 13
Ten Item Personality 0.936 0.852 0.966 5 2
Time Perspective 0.92 0.865 0.954 5 11
Five Facet Mindfulness 0.909 0.828 0.965 6 13
Bis Bas 0.904 0.851 0.958 5 7
Dospert Rt 0.903 0.855 0.953 5 6
Impulsive Venture 0.9 0.815 0.95 2 17
Sensation Seeking 0.899 0.852 0.951 5 16
Mindful Attention Awareness 0.897 0.862 0.928 1 15
Erq 0.896 0.827 0.933 2 5
Upps Impulsivity 0.886 0.718 0.959 5 12
Mpq Control 0.879 0.796 0.95 1 24
Eating 0.872 0.813 0.914 4 9
Bis11 0.864 0.63 0.939 4 15
Dickman 0.85 0.721 0.91 2 12
Future Time Perspective 0.844 0.792 0.903 1 10
Dospert Rp 0.842 0.761 0.901 5 6
Dospert Eb 0.8 0.711 0.868 5 6
Selection Optimization Compensation 0.799 0.598 0.898 4 12
Leisure Time Activity 0.798 0.731 0.857 1 1
Theories Of Willpower 0.794 0.725 0.86 1 12

Number of items

Does number of items in a survey have a significant effect on the average ICC of survey measures? No.

tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'survey') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2)

# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
  to degrees of freedom [lmerMod]
Formula: icc ~ num_all_trials + (1 | dv)
   Data: tmp

REML criterion at convergence: -320933

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-10.265  -0.516  -0.004   0.540   6.540 

Random effects:
 Groups   Name        Variance Std.Dev.
 dv       (Intercept) 0.003327 0.0577  
 Residual             0.000673 0.0259  
Number of obs: 72000, groups:  dv, 72

Fixed effects:
                Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)     0.864206   0.011510 70.400000   75.08   <2e-16 ***
num_all_trials  0.001071   0.000915 70.400000    1.17     0.25    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
num_ll_trls -0.807
measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  ggplot(aes(num_all_trials, icc))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_bw()+
  xlab("Number of trials")+
  ylab("ICC")

---
title: 'Self Regulation Ontology Retest Analyses'
output:
github_document:
toc: yes
toc_float: yes
---

```{r, message=FALSE, warning=FALSE, include=FALSE}
source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/code/SRO_Retest_Analyses_Helper_Functions.R')
theme_set(theme_bw())

test_data_path = '/Users/zeynepenkavi/Documents/PoldrackLabLocal/Self_Regulation_Ontology/Data/Complete_03-29-2018/'

retest_data_path = '/Users/zeynepenkavi/Documents/PoldrackLabLocal/Self_Regulation_Ontology/Data/Retest_03-29-2018/'
```

# Introduction

The psychological construct of self-regulation, or related concepts of impulsivity, self-control, inhibition and others correlate with problematic real-world behaviors such as disordered eating (Mobbs et al., 2010, Verbeken et al., 2009, Nederkoorn et al. 2006a, Nederkoorn et al., 2006b), gambling (Lawrence et al., 2009, Fuentes et al., 2006, Alessi and Petry, 2003), drug addiction (Coffey et al., 2003, de Wit, Crean and John, 2000, Sher, Bartholow and Wood, 2000, Kirby, Petry and Bickel, 1999) or bad financial decisions (Meier and Sprenger, 2015, Meier and Sprenger, 2013, Meier and Sprenger 2012). In these lines of work measures of self-regulation are used as behavioral assays for individual difference analyses. An underlying assumption of treating behavioral measures as reflective of person-specific characteristics is that the measures of self-regulation are stable across time. In other words, that they have high test-retest reliability (or high between- and low within-subject variability). This assumption is not extensively and equally tested in the literature for all measures of self-regulation.    
Self-regulation measures can be grouped in two broad categories: Surveys and cognitive tasks. While the former typically goes through some form of psychometric testing often filtering out items with low test-retest reliability this is less frequently true for the latter. Though some empirical results on test-retest reliabilities of certain cognitive task measures are reported in pockets of the literature an exhaustive summary or systematic elimination of tasks or measures based on stability across time does not seem to exist. Test-retest reliability is, however, crucial for individual difference analyses (Hedge, Powell and Sumner, 2017).   
To answer the question of reliability of self-regulation measures comprehensively we created a large battery consisting of both cognitive task and survey measures. In this paper we make two contributions: First we provide a review of these measures along with their reported reliabilities when available. Then we report results of a large study we conducted where we collected retest reliability data using all of these measures. This effort not only fills in a notable gap in the literature in clarifying which measures of self-regulation are stable across time but also provides guidance on factors to pay attention to when constructing new tasks for individual difference measures.  

# Methods

## Data collection

This sample consists of data for 150 subjects of the original sample of 522 that has completed the initial battery of 37 cognitive tasks, 23 surveys and 3 surveys on demographics. Details of the original sample as well as quality control (qc) procedures are described elsewhere (Eisenberg et al., 2017). Invited participants were chosen randomly and only subsets of them were invited for a given batch (instead of opening the battery to all qualified subjects) with the intention to avoid a potential oversampling and bias towards "high self regulators".

```{r warning=FALSE, message=FALSE}
workers = read.csv(paste0(retest_data_path,'Local/User_717570_workers.csv'))
workers = workers %>% 
  group_by(Worker.ID) %>%
  mutate(Retest_worker=ifelse(sum(CURRENT.RetestWorker,CURRENT.RetestWorkerB2,CURRENT.RetestWorkerB3,CURRENT.RetestWorkerB4,CURRENT.RetestWorkerB5,na.rm=T)>0,1,0)) %>%
  ungroup()

worker_counts <- fromJSON(paste0(retest_data_path,'/Local/retest_worker_counts.json'))

worker_counts = as.data.frame(unlist(worker_counts))
names(worker_counts) = "task_count"
```

In total `r sum(workers$Retest_worker)` participants were invited, `r nrow(worker_counts)` began the battery, `r sum(worker_counts$task_count >= 62)` completed the battery and 150 provided data that passed qc for both time points. Our target sample size was determined in advance of data collection and data collection continued until this number of participants with data that survived qc was reached.

```{r warning=FALSE, message=FALSE}
disc_comp_date = read.csv(paste0(retest_data_path,'Local/discovery_completion_dates.csv'), header=FALSE)
val_comp_date = read.csv(paste0(retest_data_path,'Local/validation_completion_dates.csv'), header=FALSE)
test_comp_date = rbind(disc_comp_date, val_comp_date)
rm(disc_comp_date, val_comp_date)
retest_comp_date = read.csv(paste0(retest_data_path,'Local/retest_completion_dates.csv'), header=FALSE)
comp_dates = merge(retest_comp_date, test_comp_date, by="V1")
names(comp_dates) <- c("sub_id", "retest_comp", "test_comp")
comp_dates$retest_comp = as.Date(comp_dates$retest_comp)
comp_dates$test_comp = as.Date(comp_dates$test_comp)
comp_dates$days_btw = with(comp_dates, retest_comp-test_comp)
```

Data collection took place on average `r round(mean(as.numeric(comp_dates$days_btw)))` number of days after the completion of the initial battery with a range of `r round(range(as.numeric(comp_dates$days_btw)))[1]` to `r round(range(as.numeric(comp_dates$days_btw)))[2]` days.

```{r}
rm(test_comp_date, retest_comp_date, comp_dates)
```

## Demographics

```{r}
test_demog <- read.csv(paste0(retest_data_path, '/t1_data/demographic_health.csv'))

retest_demog <- read.csv(paste0(retest_data_path, 'demographic_health.csv'))

retest_demog = retest_demog[retest_demog$X %in% test_demog$X,]

names(test_demog)[which(names(test_demog) == 'X')] <-'sub_id'
names(retest_demog)[which(names(retest_demog) == 'X')] <-'sub_id'

summary(test_demog %>%
          select(Sex, Age))

summary(retest_demog %>%
          select(Sex, Age))
```

## Literature

One of the major contributions of this project is a comprehensive literature review of the retest reliabilities of the surveys and tasks that were used. We reviewed the literature on a measure (as opposed to task level) paying attention to differences in sample size, the delay between the two measurements as well as the statistic that was used to assess reliabilities (e.g. Spearman vs. Pearson correlations). Here we present a table and a visualization summarizing our findings. References mentioned in the table below can be found [here](https://docs.google.com/spreadsheets/d/1hcED4_rWSGSE8Td0aqT0kWqlLCBAzRNKBXj3tSram58/edit?usp=sharing).  

```{r}
lit_review <- read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/input/lit_review_figure.csv')

lit_review
```

```{r message=FALSE}
lit_review = lit_review %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)]),
         type = as.character(type)) %>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  arrange(task_group, raw_fit, var) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group)) %>%
  select(-measure_description)


tmp = lit_review[duplicated(lit_review$reference)==FALSE,]

nrow(tmp)
sum(tmp$sample_size)
nrow(lit_review)
```

```{r}
rm(tmp)
```

Measure level plot

```{r warning=FALSE, message=FALSE}
lit_review = lit_review %>%
  select(-reference)
```

```{r warning= FALSE, message =FALSE, echo=FALSE}
p1_legend = lit_review %>%
  filter(task == 'task') %>%
ggplot(aes(y = var, x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape=type))+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80"),
        legend.position = 'bottom') + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))

p1 = lit_review %>%
  filter(task == 'task') %>%
ggplot(aes(y = var, x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='#00BFC4')+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80"),
        legend.position = 'bottom') + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))

p2 = lit_review %>%
  filter(task == 'survey') %>%
ggplot(aes(y = var, x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color = '#F8766D')+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80")) + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 17, 3))

mylegend<-g_legend(p1_legend)

p3 <- arrangeGrob(arrangeGrob(p1 +theme(legend.position="none"),
                         p2 + theme(legend.position="none"),
                         nrow=1),
             mylegend, nrow=2,heights=c(10, 1))

ggsave('Lit_Review_Plot.jpg', plot = p3, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 24, height = 20, units = "in", dpi=100)
rm(p1, p2, p3, p1_legend, mylegend)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/Lit_Review_Plot.jpg')
```

Because this plot is difficult to digest we summarize it on a task level to give a general sense of the main takeaways. This plot naturally disregards much of the fine grained information.

```{r}
p1_t_legend <- lit_review %>%
  filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='black')+
  theme(axis.text.y = element_text(size=43),
        legend.position = 'bottom',
        axis.text.x = element_text(size=23),
        axis.title.x = element_text(size=30), 
        legend.text = element_text(size=20), 
        legend.key.width = unit(0.75, "inches"), 
        legend.title = element_text(size=28),
        legend.spacing.x = unit(0.5, "inches")) + 
  guides(size = guide_legend(override.aes = list(size=c(9,18,28))),
         shape = guide_legend(override.aes = list(size=18)))+
  xlab("Retest Reliability")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3), name="Type")+
  scale_size_continuous(name = "Sample Size")+
  geom_vline(xintercept = 0, color = "red", size = 1)

p1_t <- lit_review %>%
  filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='#00BFC4')+
  theme(axis.text.y = element_text(size=43),
        legend.position = 'none',
        axis.text.x = element_text(size=23),
        axis.title.x = element_text(size=30)) + 
  xlab("Retest Reliability")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
  scale_size_continuous(range = c(5, 35))+
  geom_vline(xintercept = 0, color = "red", size = 1)

p2_t <- lit_review %>%
  filter(task == 'survey') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) + 
  geom_point(aes(size=sample_size, shape = type), color='#F8766D')+
  theme(axis.text.y = element_text(size=43),
        legend.position = 'none',
        axis.text.x = element_text(size=23),
        axis.title.x = element_text(size=30)) + 
  xlab("Retest Reliability")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
  scale_size_continuous(range = c(5, 35))+
  geom_vline(xintercept = 0, color = "red", size = 1)

mylegend<-g_legend(p1_t_legend)

p3_t <- arrangeGrob(arrangeGrob(p1_t, p2_t, nrow=1), mylegend, nrow=2,heights=c(10, 1))

ggsave('Lit_Review_Plot_t.jpg', plot = p3_t, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 72)
rm(p1_t, p2_t, p3_t, mylegend, p1_t_legend)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/Lit_Review_Plot_t.jpg')
```

An interactive version of this plot could be find [here](https://zenkavi.github.io/SRO_Retest_Analyses/output/reports/Lit_Review_Figure.html)

Takeaways from this review are:   
- Survey measures have been reported to higher reliability compared to task measures  
- Survey measures have less variability in the reported reliabiltiy estimates compared to task measures   

## Loading datasets

The variables included in this report are:  
- meaningful variables (includes only hdddm parameters)  
- EZ diffusion parameters  
- Raw RT and Accuracy measures  
- Variables found in the literature (for comparison)  

```{r echo=FALSE}
#Get variables of interest from Ian's release
tmp1 <- read.csv(paste0(test_data_path,'meaningful_variables.csv'))
tmp2 <- read.csv(paste0(test_data_path,'meaningful_variables_noDDM.csv'))
tmp3 <- read.csv(paste0(test_data_path,'meaningful_variables_EZ.csv'))
retest_report_vars = c(names(tmp1), names(tmp2), names(tmp3))
retest_report_vars = unique(retest_report_vars)
lit_rev_vars = as.character(unique(lit_review$dv)[which(unique(lit_review$dv) %in% retest_report_vars == FALSE)])
retest_report_vars = c(retest_report_vars, lit_rev_vars)
rm(tmp1, tmp2, tmp3, lit_rev_vars)
```

### Load time 1 data
```{r}
test_data <- read.csv(paste0(retest_data_path,'t1_data/variables_exhaustive.csv'))

test_data <- test_data[,names(test_data) %in% retest_report_vars]

test_data$X <- as.character(test_data$X)
names(test_data)[which(names(test_data) == 'X')] <-'sub_id' 
```

For reference here are the variables that are **not** included in the analyses of the remainder of this report because they were not of theoretical interest in factor structure analyses of this data so far. These include drift diffusion and other model parameters for specific conditions within a task; survey variables that are not part of the dependant variables for that survey in the literature and demographics (these are saved for prediction analyses).

```{r}
test_data2 <- read.csv(paste0(retest_data_path, 't1_data/variables_exhaustive.csv'))

df <- data.frame(names(test_data2)[which(names(test_data2) %in% names(test_data) == FALSE)])
names(df) = c('vars')

df
```

```{r echo=FALSE}
rm(test_data2, df)
```

### Load time 2 data 
```{r}
retest_data <- read.csv(paste0(retest_data_path,'variables_exhaustive.csv'))

retest_data <- retest_data[,names(retest_data) %in% retest_report_vars]

retest_data$X <- as.character(retest_data$X)
names(retest_data)[which(names(retest_data) == 'X')] <-'sub_id' 
retest_data = retest_data[retest_data$sub_id %in% test_data$sub_id,]
```

### Replace HDDM parameters in t1 data

Since HDDM parameters depend on the sample on which they are fit we refit the model on t1 data for the subjects that have t2 data. Here we replace the HDDM parameters in the current t1 dataset with these refitted values. 

```{r}
hddm_refits <- read.csv(paste0(retest_data_path,'t1_data/hddm_refits_exhaustive.csv'))

hddm_refits = hddm_refits[,names(hddm_refits) %in% retest_report_vars]

hddm_refits$X <- as.character(hddm_refits$X)
names(hddm_refits)[which(names(hddm_refits) == 'X')] <-'sub_id' 

#For later comparison of whether fitting the DDM parameters on full or retest sample makes a big difference
test_data_full_sample_hddm <- test_data

#drop hddm columns from test_data
test_data = cbind(test_data$sub_id, test_data[,names(test_data) %in% names(hddm_refits) == FALSE])

#fix naming before merging
names(test_data)[which(names(test_data) == 'test_data$sub_id')] <-'sub_id'

#merge hddm refits to test data
test_data = merge(test_data, hddm_refits, by="sub_id")
```

# Results

## Data quality checks

### Demographics reliability

Point estimates of reliability for the demographic variabels.

```{r}
numeric_cols = c()

for(i in 1:length(names(test_demog))){
  if(is.numeric(test_demog[,i])){
    numeric_cols <- c(numeric_cols, names(test_demog)[i])
  }
}

demog_rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
                     icc = rep(NA, length(numeric_cols)),
                     pearson = rep(NA, length(numeric_cols)))

row.names(demog_rel_df) <- numeric_cols

for(i in 1:length(numeric_cols)){
  demog_rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog) 
  demog_rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
  demog_rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
}

demog_rel_df = demog_rel_df %>%
  mutate(var = row.names(.)) %>%
  select(var, icc, spearman, pearson) %>%
  arrange(-icc)

demog_rel_df
```

```{r}
sjt.df(demog_rel_df %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/demog_rel_table.doc")
```


```{r}
summary(demog_rel_df)
```

```{r echo=FALSE}
rm(test_demog, retest_demog, demog_rel_df, numeric_cols)
```

### Effect of delays between the two measurements

Due to our data collection strategy we did not strictly control for the delay between the two measurements as standard psychometrics studies measuring retest reliability might. 

An individual difference measure would preferably remain stable regardless of the delay between multiple measurements.

Since we only had two measurements we could not test directly whether a measure becomes less reliable depending on the delay between the two time points (The average number of days between two measurements would be the same for all measures since reliability is a measure level metric while days between completion a subject level one).

So if you regress the average difference for each measure on average delay between two measurements you are regressing a vector with varying numbers on a vector of same values, like a t test asking if the mean of the varying column, in this case the average difference score, is different than the unique value in the single value column, i.e. the average delay between two time points (Note that this should be true in theory but in practice since for each measure there might subjects for whom the dv could not be calculated there might be >1 unique values for the average delay between the two time points). This analysis is not meaningful. 

Instead of the summary metric like the retest reliability estimate for each measure we can check whether the difference score distribution for each measure depends on the delay between the two measurements. Since the difference score distribution is at subject level we can check whether the order of subjects in this distribution depends on their order in the distribution of days between completing the two tests.

Make data frame with difference between two scores for each measure for each subject. Since the scores for different measures are on different scales for comparability the difference scores are normalized (ie demeaned and dividied by the sd of the difference score distribution.) Note that the variance of the difference score distribution accounts for the variance in both time points by summing them. Normalization equates the means of each difference score distribution to 0 which would mask any meaningful change between the two time points but the analysis here does not interpret the mean of the difference score distributions but is interested in its relation to the days between completion. We check if the variables show systematic differences between the two points later.

Here we check if the difference is larger the longer the delay

```{r warning=FALSE}
numeric_cols = c()

for(i in 1:length(names(test_data))){
  if(is.numeric(test_data[,i]) & names(test_data)[i] %in% names(retest_data)){
    numeric_cols <- c(numeric_cols, names(test_data)[i])
  }
}

t1_2_difference = data.frame()

for(i in 1:length(numeric_cols)){
  tmp = match_t1_t2(numeric_cols[i],format='wide')
  tmp = tmp %>% 
  mutate(difference = scale(`2` - `1`))
  t1_2_difference = rbind(t1_2_difference, tmp)
}

t1_2_difference$difference = as.data.frame(t1_2_difference$difference)$V1

t1_2_difference = t1_2_difference %>% separate(dv, c("task", "dv2"), sep="\\.", remove=FALSE)
```

```{r echo=FALSE}
rm(tmp, i)
```

Add completion dates to this data frame.

```{r}
retest_task_comp_times = read.csv(paste0(retest_data_path, 'Local/retest_task_completion_times.csv'))
test_task_comp_times = read.csv(paste0(retest_data_path, 'Local/test_task_completion_times.csv'))
task_comp_times = merge(retest_task_comp_times, test_task_comp_times, by=c('worker_id','task'))
rm(retest_task_comp_times, test_task_comp_times)
task_comp_times = task_comp_times %>%
  select(-X.x, -X.y) %>%
  mutate(finish_day.x = as.Date(finish_day.x),
         finish_day.y = as.Date(finish_day.y),
         days_btw = finish_day.x-finish_day.y) %>%
  rename(sub_id=worker_id)
```

```{r warning=FALSE}
t1_2_difference = merge(t1_2_difference, task_comp_times[,c('sub_id', 'task','days_btw')], by=c('sub_id', 'task'))
```

```{r echo=FALSE}
rm(task_comp_times)
```

What does the distribution of differences look like: The distribution of differences between two time points for each measure 

```{r warning=FALSE, message=FALSE}
t1_2_difference %>%
  ggplot(aes(difference, alpha=dv))+
  geom_histogram(position='identity')+
  theme(legend.position = 'none')
```

How do the difference score distributions look like with respect to the days between completion?

```{r}
t1_2_difference %>%
  ggplot()+
  geom_smooth(aes(as.numeric(days_btw), abs(difference), group=factor(dv)), method='lm', se=FALSE)+
  geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
  theme(legend.title = element_blank())+
  xlab('Days between completion')+
  ylab('Absolute Scaled difference score')

```

To test if the slope of the black is significant we would run a mixed effects model with a fixed effect for days between completion, random slope for each dv depending on the days between and random intercept for each dv.

Before I was using subjects as a random effect but days between the two time points for each measure depends on subj id. What varies randomly is which dv we are looking for its distribution of differences in relation to the days between the time points. So I changed the model to have fixed effect for the days between, a random slope for (dependent variables can be differentially sensitive to th effect of days between) and a random intercept for dependent variable.

Significant fixed effect suggests that on average the longer the delay the smaller the difference.

```{r}
summary(lmerTest::lmer(abs(difference) ~ scale(days_btw)+(scale(days_btw) | dv), data=t1_2_difference)) 
```

But if I just run one mixed effects model then we don't get a sense of the simple effects (how many of the variables this effect of the days between is significant and in which direction). I can run it separately for each dv to see if all difference score distributions are affected the same way depending on the days between completion.

```{r}
get_delay_effect = function(df){
  mod = lm(abs(difference) ~ scale(days_btw), data = df)
  out = data.frame(estimate=coef(summary(mod))["scale(days_btw)","Estimate"], pval=coef(summary(mod))["scale(days_btw)","Pr(>|t|)"])
  return(out)
}

sig_days_effect = t1_2_difference %>%
  group_by(dv) %>%
  do(get_delay_effect(.)) %>%
  filter(pval<0.05)

sig_days_effect
```

For visualization I used to summarize the difference scores per person by looking at the average difference per task per subject and plot that against the number of days between completion. This yields multiple points for each value on x representing the average difference per task for each subject. But variables within a task could go in different directions (e.g. if patient proportion increases discount rate decreases) so this doesn't seem like a good idea) 

Instead I now plot all the data grouping by dependant variable and coloring depending on whether the difference score distribution for that variable has a significant slope when regressed over days between. The black is the main effect for the large multilevel model (this is the same plot as above colored by whether the difference score distribution for each dv has a significant slope depending on days between).

```{r}
t1_2_difference %>%
  mutate(sig_days_effect = ifelse(dv %in% sig_days_effect$dv, 1, 0))%>%
  arrange(-sig_days_effect) %>%
  ggplot()+
  stat_smooth (aes(as.numeric(days_btw), abs(difference), 
                  group=factor(dv, levels=unique(dv[order(sig_days_effect)]), ordered=TRUE), 
                  color=factor(sig_days_effect, levels = c(1,0))),geom="line", alpha=0.5, method='lm')+
  geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
  theme(legend.title = element_blank())+
  xlab('Days between completion')+
  ylab('Scaled difference score')+
  scale_color_discrete(breaks=c(0,1),
                       labels=c("NS", "Sig"))

ggsave('DaysBtwEffect.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 6, height = 4, units = "in", limitsize = FALSE, dpi = 100)
```

Conclusion: The average difference between the scores of a measure if anything decreases with the increased delay. This fixed effect masks the simple effect for each dv. When looked at individually only 10 variables show a dependence on days between completion. While 7 of them show decreasing differences depending on days between 3 do show increasing differences.
This suggests that the lack of stricter control over the days between completion of the measurement does not have a large effect on the stability of most measures.

effect of whether days between leads to a sig difference on reliability of that variable? plot coef of sig_day_effect over rel_df point estimate

Do the variables that show significant dependence on the delay between the completion times have systematically lower/higher reliability? Looking at the reliability point estimate over the size of the delay effect. There are very few variables to compare anyway but regardless doesn't look conclusive/concerning.

```{r}
#Create df of point estimate reliabilities
rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
                     icc = rep(NA, length(numeric_cols)),
                     pearson = rep(NA, length(numeric_cols)),
                     partial_eta_sq = rep(NA, length(numeric_cols)),
                     sem = rep(NA, length(numeric_cols)),
                     var_subs = rep(NA, length(numeric_cols)),
                     var_ind = rep(NA, length(numeric_cols)),
                     var_resid = rep(NA, length(numeric_cols)))

row.names(rel_df) <- numeric_cols

for(i in 1:length(numeric_cols)){
  rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i]) 
  rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i])
  rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i])
  rel_df[numeric_cols[i], 'partial_eta_sq'] <- get_partial_eta(numeric_cols[i])
  rel_df[numeric_cols[i], 'sem'] <- get_sem(numeric_cols[i])
  rel_df[numeric_cols[i], 'var_subs'] <- get_var_breakdown(numeric_cols[i])$subs
  rel_df[numeric_cols[i], 'var_ind'] <- get_var_breakdown(numeric_cols[i])$ind
  rel_df[numeric_cols[i], 'var_resid'] <- get_var_breakdown(numeric_cols[i])$resid
}

rel_df$dv = row.names(rel_df)
row.names(rel_df) = seq(1:nrow(rel_df))
rel_df$task = 'task'
rel_df[grep('survey', rel_df$dv), 'task'] = 'survey'
rel_df[grep('holt', rel_df$dv), 'task'] = "task"
rel_df = rel_df %>%
  select(dv, task, spearman, icc, pearson, partial_eta_sq, sem, var_subs, var_ind, var_resid)
# row.names(rel_df) = NULL
```

```{r}
sig_days_effect %>%
  left_join(rel_df, by='dv') %>%
  ggplot(aes(estimate, icc))+
  geom_point()
```

### Overlapping survery questions

Some surveys have overlapping questions. Do these correlate within and across sessions?

First determine the overlapping questions.

```{r}
tmp = read.csv(gzfile(paste0(retest_data_path, 'items.csv.gz')))
tmp = tmp %>% 
  filter(worker == 's005') %>% 
  select(item_ID, item_text) %>% 
  mutate(item_text = trimws(as.character(item_text))) %>%
  unite(item, c("item_ID", "item_text"), sep = "___")

comb = as.data.frame(t(combn(unique(tmp$item),2)))

duplicate_items = comb %>% 
  filter(grepl('dospert', V1)==FALSE) %>%
  filter(grepl('selection_optimization', V1)==FALSE) %>%
  filter(grepl('sensation_seeking', V1)==FALSE) %>%
  separate(V1, c("item1_ID", "item1_text"), sep="___") %>%
  separate(V2, c("item2_ID", "item2_text"), sep="___") %>%
  mutate(similarity = levenshteinSim(item1_text, item2_text)) %>%
  filter(similarity>0.8) %>%
  select(item1_ID, item2_ID, item1_text, item2_text)

duplicate_items
```

```{r warning=FALSE, message=FALSE}
#surveys to read in
extract_items = c('worker',unique(with(duplicate_items, c(item1_ID, item2_ID))))

#correlations to compute:
#item1_t1 - item2_t1, 
#item1_t2 - item2_t2, 
#item1_t1 - item2_t2, 
#item1_t2 - item2_t1

duplicate_items_data_t1 = read.csv(paste0(test_data_path, 'subject_x_items.csv'))
duplicate_items_data_t2 = read.csv(paste0(retest_data_path, 'subject_x_items.csv'))

duplicate_items_data_t1 = duplicate_items_data_t1 %>%
  filter(worker %in% duplicate_items_data_t2$worker) %>%
  select(extract_items)

duplicate_items_data_t2=duplicate_items_data_t2 %>%
  filter(worker %in% duplicate_items_data_t1$worker) %>%
  select(extract_items)

duplicate_items = duplicate_items %>%
  mutate(t1_t1_cor = NA,
         t2_t2_cor = NA,
         t1_t2_cor = NA,
         t2_t1_cor = NA,
         t1_t1_polycor = NA,
         t2_t2_polycor = NA,
         t1_t2_polycor = NA,
         t2_t1_polycor = NA)

for(i in 1:nrow(duplicate_items)){
  duplicate_items$t1_t1_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t2_t2_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t1_t2_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t2_t1_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                                  duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
  
  duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])

  duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])

  duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])

  duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
                      duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}

duplicate_items
```

```{r}
summary(duplicate_items$t1_t1_polycor)
summary(duplicate_items$t2_t2_polycor)
summary(duplicate_items$t1_t2_polycor)
summary(duplicate_items$t2_t1_polycor)
```

```{r echo=FALSE}
rm(tmp, comb, sig_days_effect, t1_2_difference, duplicate_items, duplicate_items_data_t1, duplicate_items_data_t2, extract_items, get_delay_effect)
```

## Comparison to prior literature

Read in and process bootstrapped results.

```{r}
process_boot_df = function(df){
  df = df %>%
  drop_na() %>%
  mutate(dv = as.character(dv),
         icc = as.numeric(as.character(icc)),
         spearman = as.numeric(as.character(spearman)),
         pearson = as.numeric(as.character(pearson)),
         eta_sq = as.numeric(as.character(eta_sq)),
         sem = as.numeric(as.character(sem)),
         partial_eta_sq = as.numeric(as.character(partial_eta_sq)),
         omega_sq = as.numeric(as.character(omega_sq)),
         var_subs = as.numeric(as.character(var_subs)),
         var_ind = as.numeric(as.character(var_ind)),
         var_resid = as.numeric(as.character(var_resid)),
         F_time = as.numeric(as.character(F_time)),
         p_time = as.numeric(as.character(p_time)),
         df_time = as.numeric(as.character(df_time)),
         df_resid = as.numeric(as.character(df_resid)))
  return(df)} 

boot_df <- read.csv(gzfile(paste0(retest_data_path,'bootstrap_merged.csv.gz')))

boot_df = process_boot_df(boot_df)

boot_df = boot_df[boot_df$dv %in% retest_report_vars,]

# Check if you have all variables bootstrapped
# retest_report_vars[which(retest_report_vars %in% boot_df$dv==FALSE)]

# Boot df contains hddm parameters fit on the full sample in the t1 data
# refits_bootstrap_merged.csv.gz contains bootstrapped reliabilities 

refit_boot_df = read.csv(gzfile(paste0(retest_data_path,'refits_bootstrap_merged.csv.gz')))

refit_boot_df = process_boot_df(refit_boot_df)

fullfit_boot_df = boot_df[as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]

boot_df = boot_df[!as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]

boot_df = rbind(boot_df, refit_boot_df)

rm(refit_boot_df)
```

Summarize bootstrapped results and merge to lit review data

```{r message=FALSE}
var_boot_df = boot_df %>%
  group_by(dv) %>%
  summarise(mean_icc = mean(icc),
            mean_pearson = mean(pearson))

rel_comp = lit_review %>%
  left_join(var_boot_df, by = 'dv')
```

Here's what our data looks like: (583 data points for 171 measures)

```{r}
rel_comp
```

### Lit review results

Distribution of reliabilities, sample sizes and delays

```{r}
rel_comp %>% 
  select(dv, task, retest_reliability, sample_size, days) %>%
  filter(days < 3600) %>%
  gather(key, value, -dv, -task) %>%
  ggplot(aes(value, fill=task))+
  geom_density(alpha=0.5, position='identity')+
  facet_wrap(~key, scales='free')+
  theme(legend.title = element_blank())
```

```{r}
summary(rel_comp$sample_size[rel_comp$task == "survey"])
summary(rel_comp$sample_size[rel_comp$task == "task"])
```

The literature has smaller sized samples for task measures compared to survey measures that report retest reliability.

```{r}
summary(lm(sample_size ~ task,rel_comp))
```

What predicts retest reliability in the literature?
Task, sample size, days

```{r}
mod1 = lmer(retest_reliability ~ task + (1|dv), rel_comp)
mod2 = lmer(retest_reliability ~ task + sample_size + (1|dv), rel_comp)

anova(mod1, mod2)
```

```{r}
mod3 = lmer(retest_reliability ~ task + sample_size + days+ (1|dv), rel_comp)

anova(mod2, mod3)
```

```{r}
summary(mod2)
```

Tasks have significantly lower reliability and reliability decreases with increasing sample size.

```{r}
rel_comp %>% 
  ggplot(aes(sample_size, retest_reliability, color=task))+
  geom_smooth(method='lm')+
  geom_point(alpha = 0.2)+
  theme(legend.title = element_blank())+
  xlab("Sample Size")+
  ylab("Retest reliability")
```

I used to compare effect sizes of effect of task on reliability estimates in literature vs our results (just looking at the variables you find in the literature) but removed this analysis because   
1. the estimates in the literature are not all the same statistic
2. from our data I was only using ICC's
3. I was sampling from our bootstrapped results as many datapoints for each measure as there are in the literature but that didn't seem like the best way to make use of our data to make it comparable to the literature.

Despite these problems the main result was that the effect size was much larger in our dataset compared to the literature but given the problems I think the sampling analyses below are more informative than this.

We also checked whether our results diverge most from studies with smaller sample sizes. Square difference between our mean estimate and the reliability from the literature decreases exponentially with sample size. The smaller the sample size in the literature the more the reliability estimate differs from our results. But this was a weak result because most of the studies in the literature have smaller sample sizes and you see both small and large deviations for these studies (these were not significant either).

### Direct relation to our results

Correlation between our mean estimates from bootstrapped samples and the literature review for task variables

```{r}
n_df = rel_comp %>% 
  group_by(dv) %>%
  tally()

lit_emp_cor = function(){
  
  boot_comp = data.frame()
  
  for(i in 1:length(unique(rel_comp$dv)) ){
    cur_dv = unique(rel_comp$dv)[i]
    n = n_df$n[n_df$dv == cur_dv]
    sample_df = boot_df %>% filter(dv == cur_dv)
    tmp = sample_n(sample_df, n)
    boot_comp = rbind(boot_comp, tmp)
  }  
  
  rm(cur_dv, n, sample_df, tmp)
  
  #check if cbind is ok
  # sum(boot_comp$dv == rel_comp$dv)
  #cbinding pearson because that is the most common metric in the lit
  rel_comp = cbind(rel_comp, boot_comp$pearson)
  #rename new column
  names(rel_comp)[which(names(rel_comp) == "boot_comp$pearson")] = "pearson"
  
  out = data.frame(task = NA, survey = NA)
  
  out$task = with(rel_comp %>% filter(task == "task"), cor(pearson, retest_reliability))
  
  out$survey = with(rel_comp %>% filter(task == "survey"), cor(pearson, retest_reliability))
  
  rel_comp = rel_comp[,-16]
  
  return(out)
}

lit_emp_cor_out = plyr::rdply(100, lit_emp_cor)

write.csv(lit_emp_cor_out,'/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/lit_emp_cor_out.csv')

summary(lit_emp_cor_out)

```

### Noise ceiling

Model comparisons building model to predict the reliabilities in the literature from a sample from our results versus a sample form the literature.

sample one row per measure out of lit review
r^2 of retest_reliability ~ sampled_reliability vs. 
r^2 of retest_reliability ~ mean_icc

coef of retest_reliability ~ sampled_reliability vs. 
coef of retest_reliability ~ mean_icc

```{r}
comp_lit_pred <- function(df){
  
  sample_from_dv <- function(df){
    if(nrow(df)>1){
      row_num = sample(1:nrow(df),1)
      sample_row = df[row_num,]
      df = df[-row_num,]
      df$lit_predictor = sample_row$retest_reliability
    }
    return(df)
  }
  
  sampled_df = df %>%
    group_by(dv) %>%
    do(sample_from_dv(.)) 
  
  mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size)+task, data=sampled_df)
  mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size)+task, data=sampled_df)
  
  out = data.frame(r2_lit = summary(mod_lit)$r.squared,
                   r2_boot = summary(mod_boot)$r.squared,
                   m_lit = coef(summary(mod_lit))["lit_predictor","Estimate"],
                   m_boot = coef(summary(mod_boot))["mean_pearson","Estimate"])
  
  return(out)
}

comp_lit_pred_out = plyr::rdply(1000, comp_lit_pred(rel_comp))
```


```{r}
tmp = comp_lit_pred_out %>%
  select(-.n) %>%
  gather(key, value) %>%
  separate(key, c("stat", "sample"), sep = "_")
 
tmp$stat = as.factor(tmp$stat)
levels(tmp$stat) <- c("coefficient", expression(r^"2"))


tmp %>%  
  ggplot(aes(value, fill=sample))+
  geom_density(alpha = 0.5, position='identity', color=NA)+
  facet_grid(.~stat, scales='free', labeller = label_parsed)+
  scale_fill_discrete(breaks=c("boot","lit"),
                       labels=c("Empirical", "Literature"),
                      name="Prediction")+
  xlab('')+
  ylab('')

ggsave('Lit_Noise_Ceiling.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 10, height = 4, units = "in", limitsize = FALSE, dpi = 100)
```


```{r}
with(comp_lit_pred_out, t.test(r2_lit, r2_boot, paired=T))
```

```{r}
with(comp_lit_pred_out, t.test(m_lit, m_boot))
```

```{r echo=FALSE}
rm(tmp, rel_comp, mod1, mod2, mod3, n_df, i, lit_emp_cor, lit_emp_cor_out, comp_lit_pred, comp_lit_pred_out)
```

## Relationship between reliability metrics (point estimates)

Based on [Weir (2005)](https://pdfs.semanticscholar.org/d99a/790cce43f7f20d742f9d379b79de4f767740.pdf) ICC(3,k) does not take in to account within subject differences between two time points (i.e. the fixed effect of time/systematic error). Thus, it is well approximated by Pearson's r and subject to similar criticisms. [Weir (2005)](https://pdfs.semanticscholar.org/d99a/790cce43f7f20d742f9d379b79de4f767740.pdf) suggests reporting at least this systematic error effect size if one chooses to report with ICC(3,k). Based on his conclusions here I report:  
- ICC(3,k): As Dave clarified this ranges from 1 to -1/(number of repeated measures -1) so in our case this range would be [-1, 1]; larger values would mean that the two scores of a subject for a given measure are more similar to each other than they are to scores of other people  
- "ICC is reflective of the ability of a test to differentiate between different individuals"
- partial $\eta^2$ for time ($SS_{time}/SS_{within}$): effect size of time   
- SEM ($\sqrt(MS_{error})$): standard error of measurement; the smaller the better. It 
"quantifies the precision of individual scores on a test" and is not dependent on the sample in the way ICC is since it doesn't depend on between subject reliability (at least in this formulation) but is unit-dependent.

We calculated `r as.numeric(table(rel_df$task)[1])` measures for surveys and `r as.numeric(table(rel_df$task)[2])` measures for cognitive tasks.  

Though we are primarily reporting ICC's as our metric of reliability the results don't change depending on the metric chosen. Here we plot point estimates of three different reliability metrics against each other (ICC, Pearson, Spearman). The bootstrapped version is essentially the same but the plots are busier due to more datapoints.

```{r}
table(rel_df$task)
```

```{r warning=FALSE, message=FALSE}
p1 = rel_df %>%
  ggplot(aes(spearman, icc, col=task))+
  geom_point()+
  theme_bw()+
  theme(legend.title = element_blank(),
        legend.position = "none")+
  geom_abline(intercept = 0, slope=1)

p2 = rel_df %>%
  ggplot(aes(pearson, icc, col=task))+
  geom_point()+
  theme_bw()+
  theme(legend.title = element_blank(),
        legend.position = "none")+
  geom_abline(intercept = 0, slope=1)

p3 = rel_df %>%
  ggplot(aes(pearson, spearman, col=task))+
  geom_point()+
  theme_bw()+
  theme(legend.title = element_blank(),
        legend.position = "none")+
  geom_abline(intercept = 0, slope=1)

grid.arrange(p1, p2, p3, nrow=1)

ggsave('Metric_Scatterplots.jpg', plot = grid.arrange(p1, p2, p3, nrow=1), device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 12, height = 4, units = "in", limitsize = FALSE, dpi = 100)
```

As the scatter plots depict the correlations between different types reliability metrics were very high.

```{r}
cor(rel_df[,c('spearman', 'icc', 'pearson')])
```

```{r echo=FALSE}
rm(p1,p2,p3)
```

Note: Some variables have <0 ICC's. This would be the case if the $MS_{error}$>$MS_{between}$. Data for these variables have no relationship between the two time points.

## Summary of all measure reliabilities

Summarized bootstrapped reliabilities

```{r message=FALSE, warning=FALSE}
boot_df %>%
  group_by(dv) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            spearman_median = quantile(spearman, probs = 0.5),
            spearman_2.5 = quantile(spearman, probs = 0.025),
            spearman_97.5 = quantile(spearman, probs = 0.975)) %>%
  datatable() %>%
  formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
```

```{r echo=FALSE}
measure_labels <- read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/input/measure_labels.csv')
measure_labels = measure_labels %>% select(-measure_description)
# Check if there are any missing variables
# retest_report_vars[(retest_report_vars %in% measure_labels$dv == FALSE)]
# measure_labels$dv[(measure_labels$dv %in% retest_report_vars == FALSE)]
```

```{r warning=FALSE, message=FALSE}
# Df wrangling for plotting
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') 

tmp = tmp %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
  separate(var, c("var"), sep="\\.",remove=TRUE,extra="drop") %>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  arrange(task_group, var)

tmp = tmp %>%
  left_join(rel_df[,c("dv", "icc")], by = "dv") %>%
  rename(icc = icc.x, point_est = icc.y)

#Manual correction
tmp = tmp %>%
  mutate(task = ifelse(task_group == 'holt laury survey', "task", as.character(task))) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group))
```

Variable level summary of bootstrapped reliabilities.

```{r warning=FALSE, message=FALSE} 
p4 <- tmp %>%
  filter(task == 'task',
         raw_fit == 'raw') %>%
ggplot(aes(y = var, x = icc)) + 
  geom_point(color = '#00BFC4')+
  geom_point(aes(y = var, x = point_est), color = "black")+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
  theme(panel.spacing = unit(0.75, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180, size=36),
        axis.text.y = element_text(size=20),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80")) + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  geom_vline(xintercept = 0, color = "red", size = 1)

p5 <- tmp %>%
  filter(task == 'survey') %>%
ggplot(aes(y = var, x = icc)) + 
  geom_point(color = '#F8766D')+
  geom_point(aes(y = var, x = point_est), color = "black")+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
  theme(panel.spacing = unit(0.75, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180, size=36),
        axis.text.y = element_text(size=20),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey80")) + 
  xlab("")+
  ylab("")+
  scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
  geom_vline(xintercept = 0, color = "red", size = 1)
  
p6 <- arrangeGrob(p4, p5,nrow=1)

ggsave('Bootstrap_Raw_Var_Plot.jpg', plot = p6, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 36, height = 72, units = "in", limitsize = FALSE, dpi=50)

rm(p4, p5, p6)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/Bootstrap_Raw_Var_Plot.jpg')
```

```{r message=FALSE, warning=FALSE}
p4_t <- tmp %>%
  filter(task == 'task') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) + 
  geom_violin(fill='#00BFC4')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p5_t <- tmp %>%
  filter(task == 'survey') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) + 
  geom_violin(fill='#F8766D')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=43))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()
  
p6_t <- arrangeGrob(p4_t, p5_t,nrow=1)

ggsave('Bootstrap_Poster_Plot_t.jpg', plot = p6_t, device = "jpeg", path = "../output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 100)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/Bootstrap_Poster_Plot_t.jpg')
```

```{r echo=FALSE}
rm(p4_t, p5_t, p6_t)
```

Example of the overlaying procedure.

```{r}
p1<- tmp %>%
  filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = var, y = icc)) + 
  geom_violin(fill='#F8766D')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p2<- tmp %>%
  filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc, group=dv)) + 
  geom_violin(fill='#F8766D', position = 'identity')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p3<- tmp %>%
  filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) + 
  geom_violin(fill='#F8766D', position = 'identity')+
  theme_bw() + 
  theme(axis.text.y = element_text(size=30))+
  xlab("")+
  ylab("")+
  scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  geom_hline(yintercept = 0, color = "red", size = 1)+
  coord_flip()

p4 <- arrangeGrob(p1,p2, p3,nrow=1)

ggsave('Bootstrap_Example_Plot_t.jpg', plot = p4, device = "jpeg", path = "../output/figures/", width = 41, height = 5, units = "in", limitsize = FALSE, dpi = 100)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/Bootstrap_Example_Plot_t.jpg')
```

```{r}
rm(p1, p2, p3, p4)
```

## Survey vs Tasks

Comparison of survey measures to cognitive task measures in the bootstrapped results. Multilevel model with random intercepts for each measure and fixed effect of survey versus cognitive measure. 

```{r}
boot_df = boot_df %>%
    mutate(task = ifelse(grepl("survey",dv), "survey","task"),
           task = ifelse(grepl("holt",dv), "task", task))

boot_df %>%
  group_by(task) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            spearman_median = quantile(spearman, probs = 0.5),
            spearman_2.5 = quantile(spearman, probs = 0.025),
            spearman_97.5 = quantile(spearman, probs = 0.975),
            num_vars = n()/1000) %>%
  datatable() %>%
  formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
```


```{r}
summary(lmerTest::lmer(icc ~  task + (1|dv), boot_df))
```

### Variance breakdown

The quantiative explanation for the difference in reliability estimates between surveys and tasks, as recently detailed by Hedge et al. (2017), lies in the difference in sources of variance between these measures. Specifically, the ICC is calculated as the ratio of variance between subjects variance to all sources of variance. Thus, measures with high between subjects variance would have high test-retest reliability. Intuitively, measures with high between subjects variance are also better suited for individual difference analyses as they would capture the differences between the subjects in a sample.

Here we first plot the percentage of variance explained by the three sources of variance for the point estimates of measure reliabilities. The plot only includes raw measures (no DDM parameters) and the measures are ranked by percentage of between subject variability for each task/survey (i.e. the best to worst individual difference measure for each task/survey). Then we compare statistically whether the percentage of variance explained by these sources differ between tasks and surveys.

```{r warning=FALSE, message=FALSE}
tmp = rel_df %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100, 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
  select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
  mutate(dv = factor(dv, levels = dv[order(task)])) %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
  arrange(task_group, var_subs_pct) %>%
  mutate(rank = row_number()) %>%
  arrange(task, task_group, rank) %>%
  gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
  ungroup()%>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group)) %>%
  filter(task=="task",
         !grepl("EZ|hddm", dv))%>%
  arrange(task_group, rank)
labels = tmp %>%
  distinct(dv, .keep_all=T)

p1 <- tmp %>%
  ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
  geom_bar(stat='identity', alpha = 0.75, color='#00BFC4')+
  scale_x_discrete(breaks = labels$rank,
                       labels = labels$var)+
  coord_flip()+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey85"),
        legend.position = 'bottom')+
  theme(legend.title = element_blank())+
  scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
                      labels = c("Variance between individuals",
                                 "Variance between sessions",
                                 "Error variance"),
                  values=c("grey65", "grey45", "grey25"))+
  ylab("")+
  xlab("")

tmp = rel_df %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100, 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
  select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
  mutate(dv = factor(dv, levels = dv[order(task)])) %>%
  separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
  mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
  arrange(task_group, var_subs_pct) %>%
  mutate(rank = row_number()) %>%
  arrange(task, task_group, rank) %>%
  gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
  ungroup()%>%
  mutate(task_group = gsub("_", " ", task_group),
         var = gsub("_", " ", var)) %>%
  mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
  mutate(task_group = gsub("survey", "", task_group)) %>%
  filter(task=="survey")%>%
  arrange(task_group, rank)
labels = tmp %>%
  distinct(dv, .keep_all=T)

p2 <- tmp %>%
  ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
  geom_bar(stat='identity', alpha = 0.75)+
  geom_bar(stat='identity', color='#F8766D', show.legend=FALSE)+
  scale_x_discrete(breaks = labels$rank,
                       labels = labels$var)+
  coord_flip()+
  facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
  theme(panel.spacing = unit(0.5, "lines"), 
        strip.placement = "outside",
        strip.text.y = element_text(angle=180),
        panel.background = element_rect(fill = NA),
        panel.grid.major = element_line(colour = "grey85"),
        legend.position = 'bottom')+
  theme(legend.title = element_blank())+
  scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
                      labels = c("Variance between individuals",
                                 "Variance between sessions",
                                 "Error variance"),
                  values=c("grey65", "grey45", "grey25"))+
  ylab("")+
  xlab("")

mylegend<-g_legend(p2)

p3 <- arrangeGrob(arrangeGrob(p1 +theme(legend.position="none"),
                         p2 + theme(legend.position="none"),
                         nrow=1),
             mylegend, nrow=2,heights=c(10, 1))

ggsave('Variance_Breakdown_Plot.jpg', plot = p3, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 24, height = 20, units = "in", dpi = 100)
rm(tmp, labels, p1, p2 , p3)
```

```{r echo=FALSE, out.width='100%'}
knitr::include_graphics('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/Variance_Breakdown_Plot.jpg')
```

Comparing types of variance for survey vs task measures: Survey measures have higher between subject variability  

Note: This analysis includes DDM variables too.

Running separate models for different sources of variance because interactive model with variance type*task seemed too complicated.

First we find that task measures have a smaller percentage of their overall variance explained by variability between subjects compared to survey measures.

```{r}
tmp = boot_df %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100, 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
  select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) 

summary(lmerTest::lmer(var_subs_pct~task+(1|dv),tmp%>%select(-var_ind_pct,-var_resid_pct)))
```

We also find that a significantly larger percentage of their variance is explained by between session variability. Larger between session variability suggests systematic differences between the two sessions. Such systematic effects can be due to e.g. learning effects as explored later.

```{r}
summary(lmerTest::lmer(var_ind_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_resid_pct)))
```

```{r}
summary(lmerTest::lmer(var_resid_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_ind_pct)))
```

```{r}
tmp_save = tmp%>%
  gather(key, value, -dv, -task) %>%
  group_by(task, key) %>%
  summarise(median = median(value),
            sd = sd(value)) %>%
  mutate(key = ifelse(key == 'var_ind_pct', 'Between session variance', ifelse(key == 'var_subs_pct', 'Between subjects variance', ifelse(key == 'var_resid_pct', 'Residual variance',NA)))) %>%
  rename(Median = median, SD = sd) %>%
  arrange(task, key)

tmp_save
```

```{r}
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/var_breakdown_summary.doc")
```

Summarizing for clearer presentation. This graph is currently using the bootstrapped reliabilities and is therefore messier than if just using the point estimates.

```{r}
tmp %>%
  gather(key, value, -dv, -task) %>%
  group_by(task, key) %>%
  summarise(mean_pct = mean(value),
            sd_pct = sd(value),
            n = n()) %>%
  mutate(cvl = qt(0.025, n-1),
         cvu = qt(0.975, n-1),
         cil = mean_pct+(sd_pct*cvl)/sqrt(n),
         ciu = mean_pct+(sd_pct*cvu)/sqrt(n),
         sem_pct = sd_pct/sqrt(n)) %>%
  ggplot(aes(factor(key, levels = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
                    labels = c("Variance between individuals",
                               "Variance between sessions",
                               "Error variance")), mean_pct, color=task))+
  geom_point(position=position_dodge(width = 0.25), size = 4)+
  # geom_errorbar(aes(ymin=mean_pct-sd_pct, ymax=mean_pct+sd_pct), position=position_dodge(width = 0.25), width=0, size=2)+
  geom_errorbar(aes(ymin=cil, ymax=ciu), position=position_dodge(width = 0.25), width=0.25, size=2)+
  theme_bw()+
  xlab('')+
  ylab('Percent')+
  theme(legend.title = element_blank(),
        legend.text = element_text(size=14),
        legend.position = 'bottom',
        axis.text = element_text(size=12),
        axis.title.y = element_text(size=12))+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))

ggsave('Variance_Breakdown_DotPlot.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 6, height = 5, units = "in", dpi = 100)
```

## Systematic effects between time points

### Anova time effects

The type of ICC we have chosen does not take within subject (between session/systematic) variance in to account. This is why Weir recommends checking whether there is a significant change based on time and examining the SEMs. These systematic effects could be meaningful and important to account for for some measures (e.g. task measures that show learning effects).

Had we chosen another kind of ICC taking this source of variance into account (e.g. 2,1 or 2,k) they could have suggested that tasks have lower reliability.

Doing a simple t-test on the difference score alone would not be a very rigourous way of testing whether any change is meaningful because two distributions from both time points with error would be compared to each other. Fortunately there are ways to take the error for both measurements in to account. 

To check whether a measure shows systematic differences between the two time points in a meaningful number of bootstrapped samples we can: check if the effect of time is significant in each bootstrapped sample and filter variables that have more than 5% of the boostrapped samples showing significant time effects.

Another way might be to compute confidence intervals using SEMs as described in the second half of Weir (2005) and check what percent of participants have scores that fall out of this range. I haven't pursued this for now. 

Here we ask: Which variables have significant time effects in more than 5% of the bootstrapped samples?

23/74 survey measures
133/372 task measures

```{r}
boot_df %>%
  select(dv, p_time, task) %>%
  mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
  group_by(dv)%>%
  summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
            task = unique(task))%>%
  filter(pct_sig_time_effect>5) %>%
  arrange(task,-pct_sig_time_effect) %>%
  ungroup()%>%
  group_by(task) %>%
  summarise(count=n())
```

Are these significantly different between surveys and tasks? No.

```{r}
chisq.test(matrix(data = c(23,74-23,133, 372-133), nrow=2))
```

Are they predominantly in one or the other direction and does that differ depending on survey vs task?

```{r}
boot_df %>%
  select(dv, p_time, task, pearson) %>%
  mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
  group_by(dv)%>%
  summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
            task = unique(task),
            mean_pearson = mean(pearson))%>%
  filter(pct_sig_time_effect>5) %>%
  arrange(task,-pct_sig_time_effect) %>%
  arrange(mean_pearson) 
  
```

### SEM CI calculation

Step 1: T = Grand mean + ICC * (Subject score - Grand mean)
Step 2: SEP = SD of both measurements * sqrt(1-ICC^2)

Here I calculate the proportion of subjects that move more than one standard error of prediction (SEP) in t2 compared to t1 for each measure. 

This is odd to think about because the larger the ICC of a measure the smaller the SEP. So very small differences between the two time points can be categorized as 'meaningful' based on the tiny SEP.

What you are interested in is not necessarily whether individuals change at all between the two time points though. You want to know if this change is systematic in one direction.

Here I calculate the proportion of people showing 'meaningful' changes in one direction or the other. To integrate both of these direction I subtracted one from the other and filtered the variables that have more than 5% of the participants showing meaningful change in one direction over the other (so if a variable has 10 participants showing difference in one and 10 in the other direction this would cancel out but a variables with 15 people showing a positive and 5 negative change would remain).


```{r}
get_ind_ci = function(dv_var){
  matched = match_t1_t2(dv_var)
  grand_mean = mean(matched$score)
  grand_sd = sd(matched$score)
  dv_icc = get_icc(dv_var)
  sep = grand_sd * sqrt(1-(dv_icc^2))
  matched = matched %>% 
    spread(time, score) %>%
    rename("t1"="1", "t2"="2") %>%
    mutate(true_score = grand_mean+(dv_icc*(t1-grand_mean)),
           ci_up = true_score+(1.96*sep),
           ci_low = true_score-(1.96*sep),
           t2_above_ci = ifelse(t2>ci_up,1,0),
           t2_below_ci = ifelse(t2<ci_low,1, 0))
    return(matched)
}

get_prop_out_ci = function(dv_var){
  get_ind_ci(dv_var) %>%
    summarise(prop_above_ci = sum(t2_above_ci)/n(),
              prop_below_ci = sum(t2_below_ci/n()))
}

#Create df of point estimate reliabilities
ind_ci_df <- data.frame(prop_above_ci = rep(NA, length(numeric_cols)),
                        prop_below_ci = rep(NA, length(numeric_cols)))

row.names(ind_ci_df) <- numeric_cols

for(i in 1:length(numeric_cols)){
    ind_ci_df[numeric_cols[i], 'prop_above_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_above_ci 
    ind_ci_df[numeric_cols[i], 'prop_below_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_below_ci 
}

ind_ci_df$dv = row.names(ind_ci_df)
row.names(ind_ci_df) = seq(1:nrow(ind_ci_df))
ind_ci_df$task = 'task'
ind_ci_df[grep('survey', ind_ci_df$dv), 'task'] = 'survey'
ind_ci_df[grep('holt', ind_ci_df$dv), 'task'] = "task"
ind_ci_df = ind_ci_df %>%
  select(dv, task, prop_above_ci, prop_below_ci)

mean_diff_df = ind_ci_df %>%
  mutate(prop_one_direction = prop_above_ci-prop_below_ci) %>%
  filter(abs(prop_one_direction)>0.05) %>%
  arrange(-abs(prop_one_direction))

mean_diff_df

# write.csv(mean_diff_df, '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/mean_diff_df.csv')
```

This doesn't make it easier to interpret whether it is a performance improvement or decline. In fact 'performance' isn't even necessarily the correct term here. Using the valence assignments of the measures look at whether it translates to an increase or decrease in self-regulation. Of the 66 this makes sense for 54 variables.

```{r warning=FALSE, message=FALSE}
mean_diff_df = read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/mean_diff_df.csv')

table(as.character(mean_diff_df$valence), useNA = "ifany")

mean_diff_df %>%
  mutate(sc_increase = ifelse(prop_one_direction>0&valence=="Pos",1,ifelse(prop_one_direction<0&valence=="Neg",1,0)),
         sc_decrease = ifelse(prop_one_direction>0&valence=="Neg",1,ifelse(prop_one_direction<0&valence=="Pos",1,0))) %>%
summarise(sum_sc_increase = sum(sc_increase,na.rm=T),
          sum_sc_decrease = sum(sc_decrease,na.rm=T))

```

Variables that show more than 5% difference in a direction do not depend on whether they translate to a self control increase or decrease.

```{r}
chisq.test(matrix(data = c(23,54-23,31,54-31), nrow=2))
```

Do they differ depending on whether they are task or survey variables? No.

```{r}
chisq.test(matrix(data = c(5,74-5,61,372-61), nrow=2))
```

```{r}
rm(mean_diff_df)
```

## Task Reliabilities

Here we summarize the results on a task level to make it more digestable and easier to make contact with the literature.  

We reduce the list of task measures to a list of one per task by averaging only the raw measures from all the trials in a task. We chose to reduce the information in this manner to avoid any bias stemming from differential amount of interest and procedures applied to certain tasks over others (e.g. a task can have over 10 measures because it has multiple conditions and we have chosen to fit DDM's for specific conditions while another might only have 2 due to our relative inexperience and lack of interest in it). We check whether the number of trials in a task has a significant effect on these average reliabilities of raw measures as well. 

We filter out the DDM parameters and measures for specific contrasts. Note that this does leave some tasks with measures that are model fits and/or for specific conditions (because at least the current datasets do not include measures that are based on all the trials **and** are raw though I could dive in to variables_exhaustive for such measures. For example the average relialibility for Kirby is based on three discount rates for specific conditions.). Here's the order of tasks by mean reliability sorted for ICC and then Spearman's $\rho$.

```{r warning=FALSE, message=FALSE}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  group_by(task_name) %>%
  summarise(median_icc = median(icc),
            median_spearman = median(spearman),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            min_spearman = min(spearman),
            max_spearman = max(spearman),
            num_measures = n()/1000,
            mean_num_trials = round(mean(num_all_trials)))%>%
  arrange(-median_icc, -median_spearman)

tmp %>%
  datatable() %>%
  formatRound(columns=c('median_spearman', 'median_icc',
                        'min_spearman', 'icc_2.5',
                        'max_spearman', 'icc_97.5'), digits=3)
```


```{r}
tmp = tmp%>%
  select(-median_spearman, -min_spearman, -max_spearman) %>%
  mutate(task_name = gsub("_", " ", task_name),
         task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE))

names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)

sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/task_rel_table.doc")
```

### Number of trials

Does number of items in a task have a significant effect on the average ICC of (mostly) raw measures for all trials from a task? No. (no effect on Spearman either)

```{r warning=FALSE, message=FALSE}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2)

# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
```

```{r warning=FALSE, message=FALSE}
measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  ggplot(aes(num_all_trials, icc))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_bw()+
  xlab("Number of trials")+
  ylab("ICC")
```

#### Trial number dependence intrameasure 

The above analysis was looking at the effect of number of trials across tasks. But some tasks might be bad for individual difference measurement regardless of how many trials there are in them whereas for others fewer trials might be yielding a sufficiently reliable measure.

For tasks for which dependent variables are estimated using many trials one can ask: Does the same measure get less reliable if fewer trials are used to estimate its reliability?

This won't make sense for all tasks. For example to estimate a risk aversion parameter you need all trials for Holt and Laury. For Kirby and Bickel you have specific conditions looking at fewer trials. The Cognitive Reflection Task might be more appropriate to analyze each item seaprately. The writing task does not have trial numbers. For all others it might be interesting to investigate.

These kinds of analyses are too task-specific and in-depth for a paper that is trying to give a global sense of the differences between self-regulation measures in their suitablity for individual difference analyses based on their stability across time. Such analyses would provide a detailed examination of how to extract the most reliable/best individual difference measure from tasks with a set of mediocre variables to begin with. Though we do not provide such a comprehensive analysis of this sort in this paper we provide a single example of this approach and hope the open access we provide to the data spurs further work.

For this example we look at the retest reliability of the three dependent measures (average accuracy, median response time, total score) from the hierarchical rule task with 360 trials. Here is a graph of how the point estimates of the retest reliability changes for each of the dependent measures using different numbers of trials to estimate them. While the reliability estimate for each of the variables respectively are ..., ... and ... using all of the trials, these values are independent of number of trials used in the task for only certain variables. Based on this analysis researchers might decide to use a version of the task with fewer trials or only measures with consistently high reliability estimates.

Example task: three by two

Post process dv's from cluster to calculate reliabilities
```{r}
t1_dvs = read.csv(paste0(retest_data_path, 't1_tbt_dvs.csv'))
t2_dvs = read.csv(paste0(retest_data_path, 't2_tbt_dvs.csv'))

t1_dvs = t1_dvs %>% select(-X)
t2_dvs = t2_dvs %>% select(-X)
```

```{r}
hr_merge = merge(t1_dvs, t2_dvs, by = c("sub_id", "breaks"))

hr_merge = hr_merge %>%
  gather(key, value, -sub_id, -breaks) %>%
  separate(key, c("dv", "time"), sep="\\.") %>%
  mutate(time = ifelse(time == "x", 1, 2))

t1_dvs = hr_merge %>%
  filter(time == 1) %>%
  select(-time) %>%
  spread(dv, value)

t2_dvs = hr_merge %>%
  filter(time == 2) %>%
  select(-time) %>%
  spread(dv, value)

# calculate point estimates for reliability of each of the variables for each break
# get_icc for each break of tmp_t1_dvs and tmp_t2_dvs

trial_num_rel_df = data.frame(breaks=rep(NA, length(unique(t1_dvs$breaks))),
                              acc_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              avg_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              std_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              avg_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              std_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              missed_percent_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              cue_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
                              task_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))))

for(i in 1:length(unique(t1_dvs$breaks))){
  cur_break = unique(t1_dvs$breaks)[i]
  tmp_t1_dvs = t1_dvs %>% filter(breaks == cur_break)
  tmp_t2_dvs = t2_dvs %>% filter(breaks == cur_break)
  trial_num_rel_df$breaks[i] = cur_break
  trial_num_rel_df$acc_icc[i] = get_icc("acc", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$avg_rt_error_icc[i] = get_icc("avg_rt_error", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$std_rt_error_icc[i] = get_icc("std_rt_error", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$avg_rt_icc[i] = get_icc("avg_rt", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$std_rt_icc[i] = get_icc("std_rt", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$missed_percent_icc[i] = get_icc("missed_percent", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_rt_100_icc[i] = get_icc("cue_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_rt_900_icc[i] = get_icc("cue_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_rt_100_icc[i] = get_icc("task_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_rt_900_icc[i] = get_icc("task_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_acc_100_icc[i] = get_icc("cue_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$cue_switch_cost_acc_900_icc[i] = get_icc("cue_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_acc_100_icc[i] = get_icc("task_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
  trial_num_rel_df$trial_switch_cost_acc_900_icc[i] = get_icc("task_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
}
rm(i, cur_break, tmp_t1_dvs, tmp_t2_dvs)

trial_num_rel_df$breaks = as.numeric(trial_num_rel_df$breaks)

write.csv(trial_num_rel_df, paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))
```

```{r warning=FALSE, message=FALSE}
# trial_num_rel_df = read.csv(paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))

cols <- c("Accuracy" = '#084594', "Average RT error" = '#99000d', "SD RT error" = '#cb181d', "Average RT correct" = '#ef3b2c', "SD RT correct" = '#fb6a4a', "Missed percentage" = '#2171b5', "Cue switch cost RT (100)" = '#fc9272', "Cue switch cost RT (900)" = '#fcbba1', "Task switch cost RT (100)" = '#fee0d2', "Task switch cost RT (900)" = '#fff5f0', "Cue switch cost Acc (100)" = '#4292c6', "Cue switch cost Acc (900)" = '#6baed6', "Task switch cost Acc (100)" = '#9ecae1', "Task switch cost Acc (900)" = '#c6dbef')

trial_num_rel_df %>%
  gather(key, value, -breaks) %>%
  ggplot(aes(breaks*10, value, color=factor(key, levels = c("acc_icc", "avg_rt_error_icc", "std_rt_error_icc", "avg_rt_icc", "std_rt_icc", "missed_percent_icc", "cue_switch_cost_rt_100_icc", "cue_switch_cost_rt_900_icc", "task_switch_cost_rt_100_icc", "task_switch_cost_rt_900_icc", "cue_switch_cost_acc_100_icc", "cue_switch_cost_acc_900_icc", "task_switch_cost_acc_100_icc", "task_switch_cost_acc_900_icc"), labels = c("Accuracy", "Average RT error", "SD RT error", "Average RT correct", "SD RT correct", "Missed percentage", "Cue switch cost RT (100)", "Cue switch cost RT (900)", "Task switch cost RT (100)", "Task switch cost RT (900)", "Cue switch cost Acc (100)", "Cue switch cost Acc (900)", "Task switch cost Acc (100)", "Task switch cost Acc (900)"))))+
  geom_point()+
  geom_line()+
  theme_bw()+
  xlab("Number of trials")+
  ylab("ICC")+
  theme(legend.title = element_blank(),
        legend.position = 'bottom')+
  scale_color_manual(values = cols,
                     breaks = c("Average RT error", "Accuracy", "SD RT error", "Missed percentage", "Average RT correct", "Cue switch cost Acc (100)", "SD RT correct", "Cue switch cost Acc (900)", "Cue switch cost RT (100)", "Task switch cost Acc (100)", "Task switch cost RT (100)", "Task switch cost Acc (900)", "Task switch cost RT (900)"))+
  ylim(-1,1)+
  guides(color = guide_legend(ncol=2, byrow=T))

ggsave('Intrameasure_Trialnum_Dependendence.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 10, height = 5, units = "in", dpi = 100)
```

### Raw vs DDM

Checking DDM results in the bootstrapped estimates. Variables using all trials are significantly more reliable compared to difference scores. Raw measures don't differ from DDM parameters. Which DDM is better depends on whether all trials are used.

```{r warning=FALSE, message=FALSE}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != 'other') %>%
  drop_na() %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')

tmp %>%
  drop_na() %>% #try removing this in final release
  group_by(overall_difference, raw_fit, rt_acc) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            spearman_median = quantile(spearman, probs = 0.5),
            spearman_2.5 = quantile(spearman, probs = 0.025),
            spearman_97.5 = quantile(spearman, probs = 0.975),
            num_vars = n()/1000) %>%
  datatable() %>%
  formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
```

```{r}
tmp_save = tmp %>%
  drop_na() %>% #try removing this in final release
  group_by(overall_difference, raw_fit, rt_acc) %>%
  summarise(icc_median = quantile(icc, probs = 0.5),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            num_vars = n()/1000) %>%
  ungroup() %>%
  mutate(overall_difference = as.character(overall_difference),
         raw_fit = as.character(raw_fit),
         rt_acc = as.character(rt_acc)) %>%
  arrange(-icc_median)

sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/ddm_rel_table.doc")
```

Comparing overall vs difference: overall has higher reliability than difference.

```{r}
summary(lmerTest::lmer(icc ~ overall_difference + (1|dv) ,tmp))
```

"consistent with the summing of variance in the difference score"

```{r}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != 'other') %>%
  drop_na() %>%
  left_join(boot_df[,c("dv", "icc", "var_subs","var_ind", "var_resid")], by = 'dv') %>%
  mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid),
         var_ind_pct = var_ind/(var_subs+var_ind+var_resid), 
         var_resid_pct = var_resid/(var_subs+var_ind+var_resid)) %>%
  select(-task, -ddm_task, -num_all_trials, -var_resid, -var_subs, -var_ind) %>%
  gather(key, value, -dv, -overall_difference, -raw_fit, -rt_acc, -icc)

tmp %>%
  filter(key == "var_resid_pct") %>%
  ggplot(aes(raw_fit, value, fill=overall_difference))+
  geom_boxplot()
  
summary(lmer(value  ~ overall_difference*raw_fit + (1|dv), data=tmp %>% filter(key=="var_resid_pct")))

#Simple effects
summary(lmer(value  ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "EZ")))

summary(lmer(value  ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "raw")))

summary(lmer(value  ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "hddm")))
```

Checking Rogosa's claim that ‘Difference scores are reliable when individual differences in true change exist.’
How do we measure ‘individual difference in true change’ 
1.between subject variability in difference score distributions
2.within subject variance in icc
Using 2 (because comparable between subject variance across different measures would be hard)
Expecting to see: positive correlation between mean icc and mean var_ind_pct
Result: The correlation between mean icc and mean var ind pct is NOT significant BUT looking at the distribution of mean var ind pct there is not a lot of variability t

```{r}
tmp = tmp %>%
  spread(key, value) %>%
  group_by(dv) %>%
  summarise(mean_var_ind_pct = mean(var_ind_pct),
            mean_var_subs_pct = mean(var_subs_pct),
            mean_icc = mean(icc)) %>%
  left_join(measure_labels, by='dv')

tmp %>%
  filter(overall_difference == 'difference') %>%
  ggplot(aes(mean_var_ind_pct, mean_icc))+
  geom_point()+
  geom_abline(slope=1, intercept=0)+
  geom_smooth(method = "lm")

summary(lm(mean_icc ~ mean_var_ind_pct,data=tmp %>%
  filter(overall_difference == 'difference')))

tmp %>%
  filter(overall_difference == 'difference') %>%
  ggplot(aes(mean_var_ind_pct))+
  geom_density()

summary(tmp$mean_var_ind_pct[tmp$overall_difference == "difference"])
summary(tmp$mean_var_subs_pct[tmp$overall_difference == "difference"])

tmp = tmp %>% filter(overall_difference == 'difference')
t.test(tmp$mean_var_ind_pct, tmp$mean_var_subs_pct, paired=T)
```

```{r}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != 'other') %>%
  drop_na() %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')
```

Comparing raw vs ddm in overall estimates: EZ is significantly better than HDDM and comparable to raw estimates.

```{r}
summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(overall_difference == "overall")))
```

Comparing raw vs ddm in difference scores: EZ is significantly worse than HDDM and comparable to raw estimates.

```{r}
summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(overall_difference == "difference")))
```

```{r}
tmp %>%
  ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
  geom_boxplot()+
  facet_wrap(~factor(overall_difference, levels=c("overall", "difference"), labels=c("Overall", "Difference")))+
  theme_bw()+
  ylab("ICC")+
  xlab("")+
  theme(legend.title = element_blank(),
        legend.position = 'bottom',
        legend.text = element_text(size=16),
        strip.text = element_text(size=16),
        axis.text = element_text(size = 16),
        text = element_text(size=16))+
  guides(fill = guide_legend(ncol = 2))+
  scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))

ggsave('Bootstrap_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)
```

```{r}
tmp2 = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(ddm_task == 1, 
         overall_difference != "condition",
         rt_acc != "other") %>%
  drop_na() %>%
  left_join(rel_df[,c("dv", "icc", "spearman")], by = 'dv')

tmp2 %>%
  ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
  geom_boxplot()+
  facet_wrap(~factor(overall_difference, levels=c("overall", "difference"), labels=c("Overall", "Difference")))+
  theme_bw()+
  ylab("ICC")+
  xlab("")+
  theme(legend.title = element_blank(),
        legend.position = 'bottom',
        legend.text = element_text(size=16),
        strip.text = element_text(size=16),
        axis.text = element_text(size = 16),
        text = element_text(size=16))+
  guides(fill = guide_legend(ncol = 2))+
  scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))

ggsave('PointEst_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)
```

## Survey reliabilities

```{r warning=FALSE, message=FALSE}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'survey') %>%
  left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  group_by(task_name) %>%
  summarise(median_icc = median(icc),
            median_spearman = median(spearman),
            icc_2.5 = quantile(icc, probs = 0.025),
            icc_97.5 = quantile(icc, probs = 0.975),
            min_spearman = min(spearman),
            max_spearman = max(spearman),
            num_measures = n()/1000,
            mean_num_trials = round(mean(num_all_trials)))%>%
  arrange(-median_icc, -median_spearman)

tmp %>%
  datatable() %>%
  formatRound(columns=c('median_spearman', 'median_icc',
                        'min_spearman', 'icc_2.5',
                        'max_spearman', 'icc_97.5'), digits=3)
```

```{r}
tmp = tmp%>%
  select(-min_spearman, -max_spearman,-median_spearman) %>%
  mutate(task_name = gsub("_", " ", task_name),
         task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE),
         task_name = gsub("Survey", "", task_name))

names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)

sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/survey_rel_table.doc")
```

### Number of items

Does number of items in a survey have a significant effect on the average ICC of survey measures? No. 

```{r warning=FALSE, message=FALSE}
tmp = measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'survey') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2)

# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
```

```{r warning=FALSE, message=FALSE}
measure_labels %>%
  mutate(dv = as.character(dv)) %>%
  filter(task == 'task') %>%
  # left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
  left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
  filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
  separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
  select(-extra_1, -extra_2) %>%
  ggplot(aes(num_all_trials, icc))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_bw()+
  xlab("Number of trials")+
  ylab("ICC")
```
